Final Project Submission

Please fill out:


Table of Contents (Links)


FINAL SUMMARY FIGURES FROM PRESENTATION

Maps were from final .csv in Tableau Public and can be viewed and downloaded from https://public.tableau.com/profile/james.irving#!/

Short how-to plot geo data in Tableau:

  • Load in your .csv dataset from your project.
    • Let it use data interpreter. It should identify zipcode as a location.
  • On your worksheet page:
    • For plotting each price for each house:
      • Drag the Measures Lat and Long onto the rows and columns boxes (top of sheet)
      • Drag the Measure price onto the Color Button under Marks.
        It should now be listed at the bottom of the Marks panel.
      • Right-click and select "Dimension"
    • For plotting median income by zipcode:
      • Drag zipcode form the Dimensions panel onto the main graph window.
        • It will automatically load in map of location.
      • Drag price onto the color button (it will now appear in the Marks window)
      • Rich click on Price. Select "Measure" > Median
    • Customize map features by selecting "Map" > Map Layers on the Menu Bar.

Outline of Data Processing and Analysis(using OSEMN model)

  1. OBTAIN: Import data, inspect, check for datatypes to convert and null values

    • Display header and info
    • Drop any unneeded columns (df.drop(['col1','col2'],axis=1)
  2. SCRUB: cast data types, identify outliers, check for multicollinearity, normalize data

    • Check and cast data types
      • [x] Check for #'s that are store as objects (df.info())
        • when converting to #'s, look for odd values (like many 0's), or strings that can't be converted
        • Decide how to deal weird/null values (df.unique(), df.isna().sum(), df.describe()-min/max, etc
      • [x] Check for categorical variables stored as integers
    • [x] Check for missing values (df.isna().sum())
      • Can drop rows or colums
      • For missing numeric data with median or bin/convert to categorical
      • For missing categorical data: make NaN own category OR replace with most common category
    • [X] Check for multicollinearity
      • use seaborn to make correlation matrix plot Evernote Link
        • Good rule of thumb is anything over 0.75 corr is high, remove the variable that has the most correl with the largest # of variables
    • [x] Normalize data (may want to do after some exploring)
      • Most popular is Z-scoring (but won't fix skew)
      • Can log-transform to fix skewed data
  1. EXPLORE:Check distributions, outliers, etc
    • [ ] Check scales, ranges (df.describe())
    • [x] Check histograms to get an idea of distributions (df.hist()) and dat transformations to perform
      • Can also do kernel density estimates
    • [x] Use scatterplots to check for linearity and possible categorical variables (df.plot(kind-'scatter')
      • categoricals will look like vertical lines
    • [x] Use pd.plotting.scatter_matrix to visualize possible relationships
    • [x] Check for linearity
  1. FIT AN INITIAL MODEL:
    • Various forms, detail later...
    • Assessing the model:
      • Assess parameters (slope,intercept)
      • Check if the model explains the variation in the data (RMSE, F, R_square)
      • Are the coeffs, slopes, intercepts in appropriate units?
      • Whats the impact of collinearity? Can we ignore?
  2. Revise the fitted model
    • Multicollinearity is big issue for lin regression and cannot fully remove it
    • Use the predictive ability of model to test it (like R2 and RMSE)
    • Check for missed non-linearity
  3. Holdout validation / Train/test split
    • use sklearn train_test_split

OBTAIN:

  • Import required packages, read in dataframe, and definefunctions
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
%matplotlib inline
inline_rc = dict(mpl.rcParams)

plt.style.use('seaborn')

import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import normaltest as normtest # D'Agostino and Pearson's omnibus test

from collections import Counter
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('kc_house_data.csv')   
# Set index, create dataframe for dropped variables with id as index for both
df.set_index=('id')

drop_me =['id','lat','long','date']
df_dropped = df[drop_me].copy()
df_dropped.set_index('id')
df.drop(drop_me,axis=1,inplace=True)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\matplotlib\__init__.py:886: MatplotlibDeprecationWarning: 
examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))

NOTE TO USER: SAVED FILES.

  • The below cell determines the output filepaths and should not be altered unless you intend to change those locations
In [1]:
folder = %pwd
fig_filepath = folder+'/Figures/'
data_filepath = folder+'/Data/'
In [3]:
df.index
Out[3]:
RangeIndex(start=0, stop=21597, step=1)

Define functions to be used

def check_column(series, nlargest):

In [4]:
# Check columns returns the datatype, null values and unique values of input series 
def check_column(series,nlargest='all'):
    print(f"Column: df['{series.name}']':")
    print(f"dtype: {series.dtype}")
    print(f"isna: {series.isna().sum()} out of {len(series)} - {round(series.isna().sum()/len(series)*100,3)}%")
        
    print(f'\nUnique non-na values:') #,df['waterfront'].unique())
    if nlargest =='all':
        print(series.value_counts())
    else:
        print(series.value_counts().nlargest(nlargest))

def log_z(col):

  • Logs the column then z-scores
In [5]:
# define log + z-score
def log_z(col):
    
    col = np.array(col)
    logcol = np.log(col)
    
    zlogcol = (logcol-np.mean(logcol))/np.sqrt(np.var(logcol))
    
    return zlogcol

def rem_out_z(col_name):

  • Logs the column then z-scores and removes zscores>3
In [6]:
def rem_out_z(col_name):

    col = np.array(col_name)
    z_col = (col - np.mean(col)) / np.sqrt(np.var(col))
    z_col[abs(z_col)>3]=np.nan
    return z_col

def multiplot(df):

  • plots heat map of correlation values of input dataframe
In [7]:
# MULTIPLOT
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def multiplot(df):

    sns.set(style="white")

    # Compute the correlation matrix
    corr = df.corr()

    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(16, 16))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, annot=True, cmap=cmap, center=0,
                
    square=True, linewidths=.5, cbar_kws={"shrink": .5}) #

def plot_hist_scat(df,target,stats):

In [8]:
# Plots histogram and scatter (vs price) side by side
def plot_hist_scat(df,target='price',stats=False):
#     plt.style.use('bmh')
    plt.style.use('fivethirtyeight')
    
#     fontTitle = {'family': 'serif',
#             'color':  'black',
#             'weight': 'normal',
#             'size': 16,
#             }
#     fontTicks = {'family': 'sans-serif',
#             'color':  'black',
#             'weight': 'normal',
#             'size': 10,
#             }
    results = [['column','K_square','p-val']]

    for column in df.describe():

        fig = plt.figure(figsize=(8,3) )#plt.figaspect(0.5))#(5,4))
        
        ax1 = fig.add_subplot(121)
        ax1.hist(df[column],density=True,label = column+' histogram',bins=20)
        ax1.set_title(column.capitalize())

        ax1.legend()
        
        ax2 = fig.add_subplot(122)
        ax2.scatter(x=df[column], y=df[target],label = column+' vs price',marker='.')
        ax2.set_title(column.capitalize())
        ax2.legend()

        fig.tight_layout()
        if stats==True:
            stat, p = normtest(df[column])
#             print(f'Normality test for {column}:K_square = {stat}, p-value = {p}')

            results.append([column,stat, p])
    return pd.DataFrame(results)

def plot_hist_scat_sns(df,target,stats):

In [9]:
#SEABORN
import matplotlib.ticker as mtick
import matplotlib.pyplot as plt

# Plots histogram and scatter (vs price) side by side
def plot_hist_scat_sns(df,target='price'):
    plt.style.use('dark_background')

    
    ## ----------- DEFINE AESTHETIC CUSTOMIZATIONS ----------- ##
    # Axis Label fonts
    fontTitle = {'fontsize': 16,
               'fontweight': 'bold',
                'fontfamily':'serif'}

    fontAxis = {'fontsize': 14,
               'fontweight': 'bold',
                'fontfamily':'serif'}

    fontTicks = {'fontsize': 12,
               'fontweight':'bold',
                'fontfamily':'serif'}

    # Formatting dollar sign labels
    fmtPrice = '${x:,.0f}'
    tickPrice = mtick.StrMethodFormatter(fmtPrice)
    

    ## ----------- PLOTTING ----------- ##
    
    ## Loop through dataframe to plot
    for column in df.describe():
    
        # Create figure with subplots for current column
        # Note: in order to use identical syntax for large # of subplots (ax[i,j]), 
        #  declare an extra row of subplots to be removed later
        fig, ax = plt.subplots(figsize=(12,10), ncols=2, nrows=2)

        ## ----- SUBPLOT 1 -----##
        i,j = 0,0
        ax[i,j].set_title(column.capitalize(),fontdict=fontTitle)
        
        # Define graphing keyword dictionaries for distplot (Subplot 1)
        hist_kws = {"linewidth": 1, "alpha": 1, "color": 'blue','edgecolor':'w'}
        kde_kws = {"color": "white", "linewidth": 1, "label": "KDE"}
        
        # Plot distplot on ax[i,j] using hist_kws and kde_kws
        sns.distplot(df[column], norm_hist=True, kde=True,
                     hist_kws = hist_kws, kde_kws = kde_kws,
                     label=column+' histogram', ax=ax[i,j])
 

        # Set x axis label
        ax[i,j].set_xlabel(column.title(),fontdict=fontAxis)
    
        # Get x-ticks, rotate labels, and return
        xticklab1 = ax[i,j].get_xticklabels(which = 'both')
        ax[i,j].set_xticklabels(labels=xticklab1, fontdict=fontTicks, rotation=45)
        ax[i,j].xaxis.set_major_formatter(mtick.ScalarFormatter())

        
        # Set y-label 
        ax[i,j].set_ylabel('Density',fontdict=fontAxis)
        yticklab1=ax[i,j].get_yticklabels(which='both')
        ax[i,j].set_yticklabels(labels=yticklab1,fontdict=fontTicks)
        ax[i,j].yaxis.set_major_formatter(mtick.ScalarFormatter())
        
        
        # Set y-grid
        ax[i, j].set_axisbelow(True)
        ax[i, j].grid(axis='y',ls='--')

        
        ## ----- SUBPLOT 2-----  ##
        i,j = 0,1
        ax[i,j].set_title(column.capitalize(),fontdict=fontTitle)

        # Define the ketword dictionaries for  scatter plot and regression line (subplot 2)
        line_kws={"color":"white","alpha":0.5,"lw":4,"ls":":"}
        scatter_kws={'s': 2, 'alpha': 0.5,'marker':'.','color':'blue'}

        # Plot regplot on ax[i,j] using line_kws and scatter_kws
        sns.regplot(df[column], df[target], 
                    line_kws = line_kws,
                    scatter_kws = scatter_kws,
                    ax=ax[i,j])
        
        # Set x-axis label
        ax[i,j].set_xlabel(column.title(),fontdict=fontAxis)

         # Get x ticks, rotate labels, and return
        xticklab2=ax[i,j].get_xticklabels(which='both')
        ax[i,j].set_xticklabels(labels=xticklab2,fontdict=fontTicks, rotation=45)
        ax[i,j].xaxis.set_major_formatter(mtick.ScalarFormatter())

        # Set  y-axis label
        ax[i,j].set_ylabel('Price',fontdict=fontAxis)
        
        # Get, set, and format y-axis Price labels
        yticklab = ax[i,j].get_yticklabels()
        ax[i,j].set_yticklabels(yticklab,fontdict=fontTicks)
        ax[i,j].get_yaxis().set_major_formatter(tickPrice) 

        # Set y-grid
        ax[i, j].set_axisbelow(True)
        ax[i, j].grid(axis='y',ls='--')       
        
        ## ---------- Final layout adjustments ----------- ##
        # Deleted unused subplots 
        fig.delaxes(ax[1,1])
        fig.delaxes(ax[1,0])

        # Optimizing spatial layout
        fig.tight_layout()
        figtitle=column+'_dist_regr_plots.png'
        plt.savefig(fig_filepath+figtitle)
    return 

def detect_outliers(df,n,features): (using IQRs)

In [10]:
# Tukey's method using IQR to eliminate 
def detect_outliers(df,n,features):
    outlier_indices = []
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        # outlier step
        outlier_step = 1.5 * IQR
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        # select observations containing more than 2 outliers
        outlier_indices = Counter(outlier_indices)        
        multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
        return multiple_outliers 
# Outliers_to_drop = detect_outliers(data,2,["col1","col2"])
# df.loc[Outliers_to_drop] # Show the outliers rows
# Drop outliers
# data= data.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)

SCRUB:

Initial inspection of dataframe, datatypes, and null values

In [11]:
df.head()
Out[11]:
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15
0 221900.0 3 1.00 1180 5650 1.0 NaN 0.0 3 7 1180 0.0 1955 0.0 98178 1340 5650
1 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400.0 1951 1991.0 98125 1690 7639
2 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0.0 1933 NaN 98028 2720 8062
3 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910.0 1965 0.0 98136 1360 5000
4 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0.0 1987 0.0 98074 1800 7503
In [12]:
# Display DataFrame Infro
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 17 columns):
price            21597 non-null float64
bedrooms         21597 non-null int64
bathrooms        21597 non-null float64
sqft_living      21597 non-null int64
sqft_lot         21597 non-null int64
floors           21597 non-null float64
waterfront       19221 non-null float64
view             21534 non-null float64
condition        21597 non-null int64
grade            21597 non-null int64
sqft_above       21597 non-null int64
sqft_basement    21597 non-null object
yr_built         21597 non-null int64
yr_renovated     17755 non-null float64
zipcode          21597 non-null int64
sqft_living15    21597 non-null int64
sqft_lot15       21597 non-null int64
dtypes: float64(6), int64(10), object(1)
memory usage: 2.8+ MB
In [13]:
# Check for columns with null values (remember strings/objects are not counted here)
res = df.isna().sum()
print(res[res>0],'\nTotal:',len(df))
# Waterfront, view, yr_renovated have missing values
waterfront      2376
view              63
yr_renovated    3842
dtype: int64 
Total: 21597

Fill in null values and recast variables for EDA

  • From examining the df.info and df.isna().sum(), there is 1 numerical data column that is currently text/object data type.
    • I will address these first since they would be excluded from preliminary visualizations.
  • There are also null values to address in waterfront, view, and yr_renovated
  • The Date column will be recast as datetime

zipcode

In [14]:
# Recast zipcode as a category
df['zipcode'] = df['zipcode'].astype('category')

sqft_basement

In [15]:
# RECASTING SQFT_BASEMENT
# Checking why sqft_basement might be an object:
# df['sqft_basement'].value_counts().nlargest(10)
check_column(df['sqft_basement'],10)
Column: df['sqft_basement']':
dtype: object
isna: 0 out of 21597 - 0.0%

Unique non-na values:
0.0       12826
?           454
600.0       217
500.0       209
700.0       208
800.0       201
400.0       184
1000.0      148
900.0       142
300.0       142
Name: sqft_basement, dtype: int64
  • sqft_basement is current an object, needs to be converted to int
    • Need to replace drop the 454 '?' values
    • There are a lot of 0's, for sqft_basement. Not sure if I should keep them in the dataset. I am for now.
    • Note: I originally replaced the ?'s with 0's, but am re-running the dataset with them dropped altogether.
In [16]:
# Removing rows with ? for sqft_basement and converting to float
print(len(df))
df['sqft_basement'].replace(to_replace = '?',value=np.nan,inplace=True) #,inplace=True)
df.dropna(axis=0,subset=['sqft_basement'],inplace=True)
df['sqft_basement'] = df['sqft_basement'].astype('float')
df['sqft_basement'].isna().sum()
check_column(df['sqft_basement'],10)
21597
Column: df['sqft_basement']':
dtype: float64
isna: 0 out of 21143 - 0.0%

Unique non-na values:
0.0       12826
600.0       217
500.0       209
700.0       208
800.0       201
400.0       184
1000.0      148
900.0       142
300.0       142
200.0       105
Name: sqft_basement, dtype: int64
In [17]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21143 entries, 0 to 21596
Data columns (total 17 columns):
price            21143 non-null float64
bedrooms         21143 non-null int64
bathrooms        21143 non-null float64
sqft_living      21143 non-null int64
sqft_lot         21143 non-null int64
floors           21143 non-null float64
waterfront       18804 non-null float64
view             21082 non-null float64
condition        21143 non-null int64
grade            21143 non-null int64
sqft_above       21143 non-null int64
sqft_basement    21143 non-null float64
yr_built         21143 non-null int64
yr_renovated     17389 non-null float64
zipcode          21143 non-null category
sqft_living15    21143 non-null int64
sqft_lot15       21143 non-null int64
dtypes: category(1), float64(7), int64(9)
memory usage: 2.8 MB

View

  • Replace the 61 null values with appropriate value for data type
  • Drop the 61 view null values
In [18]:
# Check for columns with null values (remember strings/objects are not counted here)
res = df.isna().sum()
print(res[res>0])
# Waterfront, view, yr_renovated have missing values
waterfront      2339
view              61
yr_renovated    3754
dtype: int64
In [19]:
check_column(df['view'])
Column: df['view']':
dtype: float64
isna: 61 out of 21143 - 0.289%

Unique non-na values:
0.0    19018
2.0      930
3.0      496
1.0      327
4.0      311
Name: view, dtype: int64
In [20]:
# Drop null values from view and re-check column
print(len(df))
df.dropna(axis=0,subset=['view'],inplace=True)
print(len(df))
check_column(df['view'])
21143
21082
Column: df['view']':
dtype: float64
isna: 0 out of 21082 - 0.0%

Unique non-na values:
0.0    19018
2.0      930
3.0      496
1.0      327
4.0      311
Name: view, dtype: int64
In [21]:
# Convert view to category and create a coded version called code_view
df['view'] = df['view'].astype('category')
df['code_view'] = df.view.cat.codes
df['code_view'].value_counts()
Out[21]:
0    19018
2      930
3      496
1      327
4      311
Name: code_view, dtype: int64
In [22]:
# Drop drop_me variable from main df, save in df_dropped
drop_me = 'view'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)
In [23]:
#Recheck whole df
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price            21082 non-null float64
bedrooms         21082 non-null int64
bathrooms        21082 non-null float64
sqft_living      21082 non-null int64
sqft_lot         21082 non-null int64
floors           21082 non-null float64
waterfront       18749 non-null float64
condition        21082 non-null int64
grade            21082 non-null int64
sqft_above       21082 non-null int64
sqft_basement    21082 non-null float64
yr_built         21082 non-null int64
yr_renovated     17340 non-null float64
zipcode          21082 non-null category
sqft_living15    21082 non-null int64
sqft_lot15       21082 non-null int64
code_view        21082 non-null int8
dtypes: category(1), float64(6), int64(9), int8(1)
memory usage: 2.6 MB

Waterfront

  • Recast as int string
  • Waterfront has 2339 null values to replace. make into unordered and coded into "NaN" category
    • Note this is string "NaN" not np.nan
In [24]:
check_column(df['waterfront'])
Column: df['waterfront']':
dtype: float64
isna: 2333 out of 21082 - 11.066%

Unique non-na values:
0.0    18608
1.0      141
Name: waterfront, dtype: int64
In [25]:
# Convert waterfront to category, replace null values with "NaN"' string to make it a category
df['waterfront'] = df['waterfront'].astype('category')
df['waterfront'].replace(np.nan,"NaN",inplace=True)
df['waterfront'] = df['waterfront'].astype('category')
In [26]:
# df['waterfront'].value_counts()
check_column(df['waterfront'],10)
Column: df['waterfront']':
dtype: category
isna: 0 out of 21082 - 0.0%

Unique non-na values:
0.0    18608
NaN     2333
1.0      141
Name: waterfront, dtype: int64
In [27]:
# Turn waterfront into coded column 
df['code_waterfront'] = df.waterfront.cat.codes
check_column(df['code_waterfront'])
Column: df['code_waterfront']':
dtype: int8
isna: 0 out of 21082 - 0.0%

Unique non-na values:
0    18608
2     2333
1      141
Name: code_waterfront, dtype: int64
In [28]:
# Drop drop_me variable from main df, save in df_dropped
drop_me = 'waterfront'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)
In [29]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price              21082 non-null float64
bedrooms           21082 non-null int64
bathrooms          21082 non-null float64
sqft_living        21082 non-null int64
sqft_lot           21082 non-null int64
floors             21082 non-null float64
condition          21082 non-null int64
grade              21082 non-null int64
sqft_above         21082 non-null int64
sqft_basement      21082 non-null float64
yr_built           21082 non-null int64
yr_renovated       17340 non-null float64
zipcode            21082 non-null category
sqft_living15      21082 non-null int64
sqft_lot15         21082 non-null int64
code_view          21082 non-null int8
code_waterfront    21082 non-null int8
dtypes: category(1), float64(5), int64(9), int8(2)
memory usage: 2.5 MB

yr_renovated --> is_reno

  • yr_renovated has 3754 null values
    • Presumably this is because the homes were never renovated
    • Will replace with 0.0
    • Replace all 0.0's with NaN
  • Replacing yr_renovated with binary category based on it if has been renovated or not
    • Make new column 'is_reno'
Make is_reno category (0 or 1)
In [30]:
check_column(df['yr_renovated'],10)
Column: df['yr_renovated']':
dtype: float64
isna: 3742 out of 21082 - 17.75%

Unique non-na values:
0.0       16618
2014.0       69
2003.0       31
2013.0       31
2007.0       30
2000.0       29
2005.0       29
1990.0       22
2009.0       21
2004.0       21
Name: yr_renovated, dtype: int64
In [31]:
# Fill in 0.0 yr_renovated as np.nan temporarily
df['yr_renovated'].replace(0.0,np.nan,inplace=True)

# Recheck for null values
check_column(df['yr_renovated'],10) #df.isna().sum()
Column: df['yr_renovated']':
dtype: float64
isna: 20360 out of 21082 - 96.575%

Unique non-na values:
2014.0    69
2013.0    31
2003.0    31
2007.0    30
2000.0    29
2005.0    29
1990.0    22
2009.0    21
2004.0    21
2006.0    20
Name: yr_renovated, dtype: int64
In [32]:
# Get indices of all np.nan values to be used for is_reno category below
idx_nan = df['yr_renovated'].loc[df['yr_renovated'].isna()].index
idx_notnan = df['yr_renovated'].loc[~df['yr_renovated'].isna()].index

# Make df['is_reno']
df['is_reno'] = df['yr_renovated'].copy
df['is_reno'][idx_nan] = 0
df['is_reno'][idx_notnan] = 1

check_column(df['is_reno'],10) 
Column: df['is_reno']':
dtype: int64
isna: 0 out of 21082 - 0.0%

Unique non-na values:
0    20360
1      722
Name: is_reno, dtype: int64
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [33]:
# Drop drop_me variable from main df, save in df_dropped
drop_me = 'yr_renovated'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)

SCRUBBING THUS FAR...

  • Removed null values by dropping na from sqft_basement and view.
  • Converted waterfront to category and made NaN its own separate category (since there were so many null vlaues)
  • Converted yr_renovated to is_reno simple 0 or 1 value
  • Recase zipcodes as category since there is no numerical relationship between zipcode values
  • Next to inspect distributions and scatterplots to identify which numerical columns may be better analyzed as categories
In [34]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price              21082 non-null float64
bedrooms           21082 non-null int64
bathrooms          21082 non-null float64
sqft_living        21082 non-null int64
sqft_lot           21082 non-null int64
floors             21082 non-null float64
condition          21082 non-null int64
grade              21082 non-null int64
sqft_above         21082 non-null int64
sqft_basement      21082 non-null float64
yr_built           21082 non-null int64
zipcode            21082 non-null category
sqft_living15      21082 non-null int64
sqft_lot15         21082 non-null int64
code_view          21082 non-null int8
code_waterfront    21082 non-null int8
is_reno            21082 non-null int64
dtypes: category(1), float64(4), int64(10), int8(2)
memory usage: 3.1 MB

Question 1: Which predictors should be analyzed as continuous data, vs binned/categorical data?

Identifying numerical data better analyzed as categorical

  • Will examine histogram distributions and scatter plots vs price for each variable in df.describe()
In [35]:
# Plot histograms and scatterplots vs target variable price for all numerical columns in df (show up in .describe())

fig = plot_hist_scat(df)
# fig = plot_hist_scat_sns(df)

Answer 1: How to treat each variable

  • Notes on histograms and scatterplots #### First, to comment on scatter plots that are indicative of categorical data:
  • Columns to be analyzed as categorical data: (cast as int for now, make sure its .astype('category').cat.as_ordered() one-hot coded later before running regression
    • Waterfront
    • View
    • Zipcode
  • Columns that are ordinal data. (e.g. 2 floors is indeed double 1 floor)
    • Floors
    • Bedrooms
    • Bathrooms
    • Condition
    • Grade
  • Columns that may be best analyzed by binning and casting as categorical data
    • Yr_built
    • is_reno
      • classified as simply renovated or not.**
    • sqft_basement
    • sqft_above
  • Numerical columns (that may be best analyzed as such)
    • All sqft categories
    • price
    • Note: moved sqft_basement to binned category to deal with 0's, also added sqft_above to accompany it
  • Numerical columns that were dropped
    • id
    • Lat
    • Long
    • Date
    • yr_renovated -> is_reno

Second, to comment on distributions

  • The target variable, price, seems a bit skewed and may be better analyzed as log-transformed.
    • Try both log-transformed and unaltered
  • All sqft columns seem to be skewed and should be transformed. (log)

List of vartypes/names

# Create tuples with columns names of categorical variables for easy extraction 
cat_vars = ('waterfront','view','zipcode')
ord_vars = ('grade','condition','floors','bedrooms','bathrooms')
vars_to_bin = ('yr_built','yr_renovated','sqft_above','sqft_basement')
num_vars = ('sqft_living', 'sqft_lot','sqft_living15', 'sqft_lot15')

COARSE-BINNING Numerical Data

  • yr_built, yr_renovated
  • Added sqft_basement due to 0 values
  • Added sqft_above to accompany basement

binning yr_built

In [36]:
# Check the min and max of the yr variables for binning range
df['yr_built'].describe()
Out[36]:
count    21082.000000
mean      1971.023337
std         29.323120
min       1900.000000
25%       1952.000000
50%       1975.000000
75%       1997.000000
max       2015.000000
Name: yr_built, dtype: float64
In [37]:
## Bin yr_built then make yr_built category
# yr_built min is 1900, max is 2015
bins = list(range(1900,2030,10))

df['yr_built'].replace(np.nan,0,inplace=True)
bins_yrbuilt = pd.cut(df['yr_built'], bins,include_lowest=True) # Cut into bins
# check_column(bins_yrbuilt)

df['bins_yrbuilt'] = bins_yrbuilt.astype('category').cat.as_ordered() #.cat.as_ordered()

# Inspecting the binned data counts
check_column(df['bins_yrbuilt'])
Column: df['bins_yrbuilt']':
dtype: category
isna: 0 out of 21082 - 0.0%

Unique non-na values:
(2000.0, 2010.0]      3354
(1960.0, 1970.0]      2484
(1950.0, 1960.0]      2411
(1970.0, 1980.0]      2351
(1980.0, 1990.0]      2297
(1990.0, 2000.0]      2085
(1940.0, 1950.0]      1806
(1920.0, 1930.0]      1142
(2010.0, 2020.0]      1066
(1899.999, 1910.0]     759
(1910.0, 1920.0]       747
(1930.0, 1940.0]       580
Name: bins_yrbuilt, dtype: int64
In [38]:
# Drop original 
drop_me = 'yr_built'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)

binning sqft_basement

In [39]:
df['sqft_basement'].describe()
Out[39]:
count    21082.000000
mean       291.359975
std        442.007858
min          0.000000
25%          0.000000
50%          0.000000
75%        560.000000
max       4820.000000
Name: sqft_basement, dtype: float64
In [40]:
check_column(df['sqft_basement'],10)
Column: df['sqft_basement']':
dtype: float64
isna: 0 out of 21082 - 0.0%

Unique non-na values:
0.0       12798
600.0       216
500.0       209
700.0       207
800.0       201
400.0       184
1000.0      147
900.0       142
300.0       140
200.0       105
Name: sqft_basement, dtype: int64
In [41]:
# Definine bins where bins 0-1 is its own interval
bins=[-np.inf]
[bins.append(x) for x in range(1,5500,500)]
# bins

# cut_basement = df['sqft_basement'].replace(0,np.nan)
cut_basement = df['sqft_basement'].replace(np.nan,0).copy()
# cut_basement = cut_basement.replace('NaN',0)

bins_sqftbase = pd.cut(cut_basement, bins=bins, include_lowest=True) # Cut into bins
df['bins_sqftbasement'] = bins_sqftbase.copy()

# Cast as ordered category
df['bins_sqftbasement'] = df['bins_sqftbasement'].astype('category').cat.as_ordered()

# Check result
check_column(df['bins_sqftbasement'])
Column: df['bins_sqftbasement']':
dtype: category
isna: 0 out of 21082 - 0.0%

Unique non-na values:
(-inf, 1.0]         12798
(501.0, 1001.0]      3849
(1.0, 501.0]         2617
(1001.0, 1501.0]     1476
(1501.0, 2001.0]      269
(2001.0, 2501.0]       57
(2501.0, 3001.0]       11
(3001.0, 3501.0]        3
(4501.0, 5001.0]        1
(4001.0, 4501.0]        1
(3501.0, 4001.0]        0
Name: bins_sqftbasement, dtype: int64
In [42]:
# Drop original 
drop_me = 'sqft_basement'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)

binning sqft_above

In [43]:
check_column(df['sqft_above'],10)
df['sqft_above'].describe()
Column: df['sqft_above']':
dtype: int64
isna: 0 out of 21082 - 0.0%

Unique non-na values:
1300    208
1010    206
1200    197
1220    190
1140    181
1180    174
1400    174
1340    173
1060    173
1250    168
Name: sqft_above, dtype: int64
Out[43]:
count    21082.000000
mean      1789.000000
std        828.369355
min        370.000000
25%       1200.000000
50%       1560.000000
75%       2210.000000
max       9410.000000
Name: sqft_above, dtype: float64
In [44]:
# sqft_above 
# Bins to cover range seen above in .describe
bins = list(range(0,9501,500))

# cut_above = df['sqft_above'].replace(0,np.nan)
bins_sqftabove = pd.cut(df['sqft_above'], bins=bins, include_lowest=True) # Cut into bins, including left edge 
check_column(bins_sqftabove)
Column: df['sqft_above']':
dtype: category
isna: 0 out of 21082 - 0.0%

Unique non-na values:
(1000.0, 1500.0]    7352
(1500.0, 2000.0]    4686
(2000.0, 2500.0]    2764
(500.0, 1000.0]     2488
(2500.0, 3000.0]    1815
(3000.0, 3500.0]    1088
(3500.0, 4000.0]     466
(4000.0, 4500.0]     239
(4500.0, 5000.0]      80
(5000.0, 5500.0]      42
(-0.001, 500.0]       19
(6000.0, 6500.0]      16
(5500.0, 6000.0]      14
(6500.0, 7000.0]       4
(7500.0, 8000.0]       3
(8500.0, 9000.0]       2
(7000.0, 7500.0]       2
(8000.0, 8500.0]       1
(9000.0, 9500.0]       1
Name: sqft_above, dtype: int64
In [45]:
bins_sqftabove.replace(np.nan,'NaN',inplace=True)
df['bins_sqftabove']=bins_sqftabove.astype('category').cat.as_ordered()

check_column(df['bins_sqftabove'])
Column: df['bins_sqftabove']':
dtype: category
isna: 0 out of 21082 - 0.0%

Unique non-na values:
(1000.0, 1500.0]    7352
(1500.0, 2000.0]    4686
(2000.0, 2500.0]    2764
(500.0, 1000.0]     2488
(2500.0, 3000.0]    1815
(3000.0, 3500.0]    1088
(3500.0, 4000.0]     466
(4000.0, 4500.0]     239
(4500.0, 5000.0]      80
(5000.0, 5500.0]      42
(-0.001, 500.0]       19
(6000.0, 6500.0]      16
(5500.0, 6000.0]      14
(6500.0, 7000.0]       4
(7500.0, 8000.0]       3
(8500.0, 9000.0]       2
(7000.0, 7500.0]       2
(8000.0, 8500.0]       1
(9000.0, 9500.0]       1
Name: bins_sqftabove, dtype: int64
In [46]:
# Drop original 
drop_me = 'sqft_above'
df_dropped[drop_me] = df[drop_me].copy()
df.drop(drop_me,axis=1,inplace=True)

CHECKING FOR MULTICOLLINEARITY

In [47]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price                21082 non-null float64
bedrooms             21082 non-null int64
bathrooms            21082 non-null float64
sqft_living          21082 non-null int64
sqft_lot             21082 non-null int64
floors               21082 non-null float64
condition            21082 non-null int64
grade                21082 non-null int64
zipcode              21082 non-null category
sqft_living15        21082 non-null int64
sqft_lot15           21082 non-null int64
code_view            21082 non-null int8
code_waterfront      21082 non-null int8
is_reno              21082 non-null int64
bins_yrbuilt         21082 non-null category
bins_sqftbasement    21082 non-null category
bins_sqftabove       21082 non-null category
dtypes: category(4), float64(3), int64(8), int8(2)
memory usage: 2.7 MB
In [48]:
# to_drop_copy = ['sqft_above','sqft_basement','yr_built','waterfront','view']
# df_dropped = df[to_drop_copy].copy()
# # df.drop(to_drop_copy,axis=1,inplace=True)
In [49]:
# Plot correlation  heatmaps for all data 
# pause
to_drop = ['price']
multiplot(df.drop(to_drop,axis=1))
In [50]:
# # Test which multicol variables to keep
# pause
# df[['price','sqft_living','sqft_living15','sqft_above','sqft_basement']].corr()

Answer 2:

  • Sqft_living is highly correlated with sqft_living15 and grade,
  • These correlations make sense sense since neighborhoods probably have similar construction.
    • The r values are ~0.75 (threshold) and have sufficient intuitive rationale to keep.
In [51]:
# # Save copy of data in current form.
# df.to_csv(data_filepath+ 'scrubbed_data.csv')
# df_dropped.to_csv(data_filepath+'dropped_data.csv')

EXPLORE:

  • Re-examining the basic stats and distributions of the data.
  • Decide on transformations to perform
    • Normalize afterwards.
  • Visually examine for possible relationships

EDA before normalization/transformation

Examine basic descriptive stats

In [52]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price                21082 non-null float64
bedrooms             21082 non-null int64
bathrooms            21082 non-null float64
sqft_living          21082 non-null int64
sqft_lot             21082 non-null int64
floors               21082 non-null float64
condition            21082 non-null int64
grade                21082 non-null int64
zipcode              21082 non-null category
sqft_living15        21082 non-null int64
sqft_lot15           21082 non-null int64
code_view            21082 non-null int8
code_waterfront      21082 non-null int8
is_reno              21082 non-null int64
bins_yrbuilt         21082 non-null category
bins_sqftbasement    21082 non-null category
bins_sqftabove       21082 non-null category
dtypes: category(4), float64(3), int64(8), int8(2)
memory usage: 2.7 MB
In [53]:
df.head()
Out[53]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade zipcode sqft_living15 sqft_lot15 code_view code_waterfront is_reno bins_yrbuilt bins_sqftbasement bins_sqftabove
0 221900.0 3 1.00 1180 5650 1.0 3 7 98178 1340 5650 0 2 0 (1950.0, 1960.0] (-inf, 1.0] (1000.0, 1500.0]
1 538000.0 3 2.25 2570 7242 2.0 3 7 98125 1690 7639 0 0 1 (1950.0, 1960.0] (1.0, 501.0] (2000.0, 2500.0]
2 180000.0 2 1.00 770 10000 1.0 3 6 98028 2720 8062 0 0 0 (1930.0, 1940.0] (-inf, 1.0] (500.0, 1000.0]
3 604000.0 4 3.00 1960 5000 1.0 5 7 98136 1360 5000 0 0 0 (1960.0, 1970.0] (501.0, 1001.0] (1000.0, 1500.0]
4 510000.0 3 2.00 1680 8080 1.0 3 8 98074 1800 7503 0 0 0 (1980.0, 1990.0] (-inf, 1.0] (1500.0, 2000.0]
In [54]:
df.describe()
Out[54]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_living15 sqft_lot15 code_view code_waterfront is_reno
count 2.108200e+04 21082.000000 21082.000000 21082.000000 2.108200e+04 21082.00000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000
mean 5.402469e+05 3.372403 2.115916 2080.359975 1.507759e+04 1.49362 3.409828 7.657717 1986.917418 12732.514135 0.233327 0.228014 0.034247
std 3.667323e+05 0.924996 0.768142 917.856396 4.117338e+04 0.53937 0.650597 1.173690 685.544250 27148.781580 0.765066 0.630372 0.181868
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.00000 1.000000 3.000000 399.000000 651.000000 0.000000 0.000000 0.000000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.00000 3.000000 7.000000 1490.000000 5100.000000 0.000000 0.000000 0.000000
50% 4.500000e+05 3.000000 2.250000 1910.000000 7.620000e+03 1.50000 3.000000 7.000000 1840.000000 7626.000000 0.000000 0.000000 0.000000
75% 6.450000e+05 4.000000 2.500000 2550.000000 1.069775e+04 2.00000 4.000000 8.000000 2360.000000 10088.750000 0.000000 0.000000 0.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.50000 5.000000 13.000000 6210.000000 871200.000000 4.000000 2.000000 1.000000

Notes on basic statistics

  • Bedrooms has some very clear outliers (max is 33, but 75% quartile is only 4)
    • May want to remove outliers after Z-scoring (Absolute Z-score > 3)
  • Same with bathrooms (8 is max, 75% quartile is only 2.5)
  • Same with sqft_living (max 13540, 75% quartile = 2550)
  • Also same with sqft_lot15, sqftliving15

Visualizing numerical data

  • Distributions and scatterplots
  • Note: May want to cast all categoricals as strings/categories for visualization
    • Keeping as is for now
In [55]:
# Separate out numerical 
df.describe()
Out[55]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_living15 sqft_lot15 code_view code_waterfront is_reno
count 2.108200e+04 21082.000000 21082.000000 21082.000000 2.108200e+04 21082.00000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000
mean 5.402469e+05 3.372403 2.115916 2080.359975 1.507759e+04 1.49362 3.409828 7.657717 1986.917418 12732.514135 0.233327 0.228014 0.034247
std 3.667323e+05 0.924996 0.768142 917.856396 4.117338e+04 0.53937 0.650597 1.173690 685.544250 27148.781580 0.765066 0.630372 0.181868
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.00000 1.000000 3.000000 399.000000 651.000000 0.000000 0.000000 0.000000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.00000 3.000000 7.000000 1490.000000 5100.000000 0.000000 0.000000 0.000000
50% 4.500000e+05 3.000000 2.250000 1910.000000 7.620000e+03 1.50000 3.000000 7.000000 1840.000000 7626.000000 0.000000 0.000000 0.000000
75% 6.450000e+05 4.000000 2.500000 2550.000000 1.069775e+04 2.00000 4.000000 8.000000 2360.000000 10088.750000 0.000000 0.000000 0.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.50000 5.000000 13.000000 6210.000000 871200.000000 4.000000 2.000000 1.000000
In [56]:
# INSPECTING NUMERICAL DATA DISTPLOTS + KDE
plt.style.use('seaborn')
plot_hist_scat_sns(df)
# for column in df.describe().columns:
    
#     sns.distplot(df[column],kde=True,label = column+' histogram',norm_hist=True)
#     title = column+' histogram & KDE'
#     plt.title(title.title())
#     plt.ylabel('Density')
#     plt.legend()
#     plt.show()

# Uncomment below if want new scatter_matrix
# fig = pd.plotting.scatter_matrix(df_clean,figsize=(12,12))

Notes on distplots

  • Essentialy all numerical data seems to be at least slightly skewed.
    • Do not think it is sufficient to log-transform the data and lose model interpretability

Visualizing categorical data

In [57]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price                21082 non-null float64
bedrooms             21082 non-null int64
bathrooms            21082 non-null float64
sqft_living          21082 non-null int64
sqft_lot             21082 non-null int64
floors               21082 non-null float64
condition            21082 non-null int64
grade                21082 non-null int64
zipcode              21082 non-null category
sqft_living15        21082 non-null int64
sqft_lot15           21082 non-null int64
code_view            21082 non-null int8
code_waterfront      21082 non-null int8
is_reno              21082 non-null int64
bins_yrbuilt         21082 non-null category
bins_sqftbasement    21082 non-null category
bins_sqftabove       21082 non-null category
dtypes: category(4), float64(3), int64(8), int8(2)
memory usage: 2.7 MB
In [58]:
# Define list of all categorical variables 
list_cat_vars = ['zipcode', 'bins_yrbuilt', 'bins_sqftbasement', 'bins_sqftabove','condition','grade','code_view','code_waterfront']
In [59]:
# Plotting scatterplots and boxplots for categorical data
plt.style.use('seaborn')
for column in list_cat_vars:
   
    fig = plt.figure(figsize=(12,4))

    ax1  = fig.add_subplot(121)
    ax1 = sns.stripplot(x=df[column],y=df['price'],marker='.') 
    # ax1.set_title('Price vs ',column,' scatter plot')
    title1=column+' scatter'
    ax1.set_title(title1.title())
    ax1.set_xlabel(column)
    ax1.set_ylabel('Price')

    
    ax2 = fig.add_subplot(122)
    ax2 = sns.boxplot(x=df[column],y=df['price']) 
    title2=column+' boxplot'
    ax2.set_title(title2.title())
    ax2.set_xlabel(column)
    ax2.set_ylabel('Price')
    fig.tight_layout()

Question/Answer 3A: Which categorical variables show the greatest potential as predictors?

ANSWER 3A:

Notes on categorical scatter plots

  • grade seems to be strongly related to price (notice how the whole range of values seems to increase in price at higher grades.
  • Zipcodes look to differ quite a bit in terms of price.
  • Condition looks to be highly related to price (at least above condition level 2)
  • View does not look as strongly related to price as I expected.
  • Floors do not seem as related as expected
  • yr_built may have some relationship with price
  • yr_renovated has somewhat of a trend, but recent renovations buck this trend
In [60]:
# # INSPECTING REGRESSION PLOTS
# plt.style.use('seaborn')

# plot_vars=df.describe().columns

# for column in plot_vars:
# #     df_plot=df[column]
# #     df_plot = df.loc[df[column]>0]
#     plot= sns.regplot(df[column], df['price'],robust=False,marker='.') #kde=True,label = column+' histogram')
# #     plot = sns.boxplot(df[column],df['price'])
#     title = column+' linear regression'
#     plt.title(title.title())
# #     plt.legend()
#     plt.show()
In [61]:
df.describe()
Out[61]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_living15 sqft_lot15 code_view code_waterfront is_reno
count 2.108200e+04 21082.000000 21082.000000 21082.000000 2.108200e+04 21082.00000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000 21082.000000
mean 5.402469e+05 3.372403 2.115916 2080.359975 1.507759e+04 1.49362 3.409828 7.657717 1986.917418 12732.514135 0.233327 0.228014 0.034247
std 3.667323e+05 0.924996 0.768142 917.856396 4.117338e+04 0.53937 0.650597 1.173690 685.544250 27148.781580 0.765066 0.630372 0.181868
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.00000 1.000000 3.000000 399.000000 651.000000 0.000000 0.000000 0.000000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.00000 3.000000 7.000000 1490.000000 5100.000000 0.000000 0.000000 0.000000
50% 4.500000e+05 3.000000 2.250000 1910.000000 7.620000e+03 1.50000 3.000000 7.000000 1840.000000 7626.000000 0.000000 0.000000 0.000000
75% 6.450000e+05 4.000000 2.500000 2550.000000 1.069775e+04 2.00000 4.000000 8.000000 2360.000000 10088.750000 0.000000 0.000000 0.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.50000 5.000000 13.000000 6210.000000 871200.000000 4.000000 2.000000 1.000000

[SCRUB-2] NORMALIZING & TRANSFORMING

Graphing raw vs normalized results to decide if dataset should be normalized

  • log-transform and z-score numerical data.
  • Am using detect_outlier function to apply Tukey's method for outlier remnoval based on IQR

Outlier Removal - visualizing

  • Will be using pre-defined function detect_outliers(df,n,var_name)
    • Returns index of rows containing outliers based in IQR
In [62]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21082 entries, 0 to 21596
Data columns (total 17 columns):
price                21082 non-null float64
bedrooms             21082 non-null int64
bathrooms            21082 non-null float64
sqft_living          21082 non-null int64
sqft_lot             21082 non-null int64
floors               21082 non-null float64
condition            21082 non-null int64
grade                21082 non-null int64
zipcode              21082 non-null category
sqft_living15        21082 non-null int64
sqft_lot15           21082 non-null int64
code_view            21082 non-null int8
code_waterfront      21082 non-null int8
is_reno              21082 non-null int64
bins_yrbuilt         21082 non-null category
bins_sqftbasement    21082 non-null category
bins_sqftabove       21082 non-null category
dtypes: category(4), float64(3), int64(8), int8(2)
memory usage: 2.7 MB
In [63]:
# Define variables to plot vs tukey-cleaned outliers 
vars_to_norm = ['sqft_living','sqft_lot','sqft_living15','sqft_lot15','bedrooms','bathrooms']
df.describe().columns
Out[63]:
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'condition', 'grade', 'sqft_living15', 'sqft_lot15', 'code_view',
       'code_waterfront', 'is_reno'],
      dtype='object')
In [64]:
from collections import Counter
from sklearn.preprocessing import RobustScaler
robscaler = RobustScaler()

# vars_to_norm = ['sqft_living','sqft_lot','sqft_living15','sqft_lot15','bedrooms','bathrooms']
norm_results = [['column','K_square','p-val']]

# Graph all potential normalizedvariables
for var_name in df.describe():

    var = df[var_name]
    fig = plt.figure(figsize=(12,4))
    ax1 = fig.add_subplot(121)
    ax1 = sns.distplot(var, norm_hist=True) 
    ax1.set_title('Raw '+var_name)
    #robscaler.fit_transform(np.array(var).reshape(-1,1)
    
    # OUTLIER REMOVAL
    outliers_to_drop = detect_outliers(df,0,[var_name])
    var_clean =df[var_name].drop(outliers_to_drop)
    
    ax2 = fig.add_subplot(122)
    
    ax2 = sns.distplot(var_clean,norm_hist=True)
#     ax2 = sns.distplot(robscaler.fit_transform(np.array(var_clean).reshape(-1,1)),norm_hist=True)
    
    ax2.set_title('Tukey Outliers Removed '+var_name) #+var)
    ax2.set_xlabel('Scale')
    stat, p = normtest(var_clean)
#     norm_results.append([var_clean,stat, p])
# norm_results = pd.DataFrame(norm_results[2:],columns=norm_results[0])#,index='columns')
# norm_results
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\statsmodels\nonparametric\kde.py:488: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\statsmodels\nonparametric\kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\numpy\core\fromnumeric.py:83: RuntimeWarning: invalid value encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

Question/Answer 3B: Does removal of outliers improve the distrubtions?

  • The data is skewed by outliers.
    • Comparing it to IQR-method outlier elimination reveals much improved results
    • The distributions look much more reasonabile with outliers removed.

REMOVING OUTLIERS

Recall stats observations from beginning of EDA:

Notes on basic statistics

  • Bedrooms has some very clear outliers (max is 33, but 75% quartile is only 4)
    • May want to remove outliers after Z-scoring (Absolute Z-score > 3)
  • Same with bathrooms (8 is max, 75% quartile is only 2.5)
  • Same with sqft_living (max 13540, 75% quartile = 2550)
  • Also same with sqft_lot15, sqftliving15
In [65]:
# DEFINE VARIABLES TO GET THE OUTLIERS FOR (based on observations)
# vars_to_norm = ['price','bedrooms''sqft_living','sqft_lot','sqft_living15','sqft_lot15','bedrooms','bathrooms'
vars_to_clean = ['price','bedrooms','sqft_living','bathrooms','sqft_living15']
In [66]:
# GET OUTLIER INDICES AND REPORT 

outliers_to_drop = {}

for col in vars_to_clean:
    outliers_to_drop[col] = detect_outliers(df,0,[col])
# outliers_to_drop.keys()
# outliers_to_drop.values()

# Print out # of outliers
for k, v in outliers_to_drop.items():
    print(f'col: {k} has {len(v)} outliers. ({round(len(v)/len(df),2)*100}%)' )
col: price has 1131 outliers. (5.0%)
col: bedrooms has 511 outliers. (2.0%)
col: sqft_living has 557 outliers. (3.0%)
col: bathrooms has 546 outliers. (3.0%)
col: sqft_living15 has 532 outliers. (3.0%)

Filling in df_norm

In [67]:
# Intialize df_norm with df's values
df_norm=df.copy()
In [68]:
# Iterate throught outliers_to_drop dictionary to replace outliers with np.nan
for k, v in outliers_to_drop.items():
    df_norm.loc[v,k] = np.nan # axis=0,inplace=True)

# Display null values
df_norm.isna().sum()
Out[68]:
price                1131
bedrooms              511
bathrooms             546
sqft_living           557
sqft_lot                0
floors                  0
condition               0
grade                   0
zipcode                 0
sqft_living15         532
sqft_lot15              0
code_view               0
code_waterfront         0
is_reno                 0
bins_yrbuilt            0
bins_sqftbasement       0
bins_sqftabove          0
dtype: int64
In [69]:
# REMOVING OUTLIERS FROM PRICE
drop_col = 'price'

print(f'for {drop_col}:')
print(f'# of intial rows: {len(df_norm)}')
df_norm.dropna(subset=['price'],inplace=True)
print(f'# after dropping rows: {len(df_norm)}')
print(f'\nOutliers remaining: \n{df_norm.isna().sum()}')
for price:
# of intial rows: 21082
# after dropping rows: 19951

Outliers remaining: 
price                  0
bedrooms             449
bathrooms            253
sqft_living          169
sqft_lot               0
floors                 0
condition              0
grade                  0
zipcode                0
sqft_living15        231
sqft_lot15             0
code_view              0
code_waterfront        0
is_reno                0
bins_yrbuilt           0
bins_sqftbasement      0
bins_sqftabove         0
dtype: int64
In [70]:
# REMOVING OUTLIERS FROM BEDROOMS
drop_col = 'bedrooms'

print(f'for {drop_col}:')
print(f'# of intial rows: {len(df_norm)}')
df_norm.dropna(subset=[drop_col],inplace=True)
print(f'# after dropping rows: {len(df_norm)}')
print(f'\nOutliers remaining: \n{df_norm.isna().sum()}')
for bedrooms:
# of intial rows: 19951
# after dropping rows: 19502

Outliers remaining: 
price                  0
bedrooms               0
bathrooms            196
sqft_living          150
sqft_lot               0
floors                 0
condition              0
grade                  0
zipcode                0
sqft_living15        228
sqft_lot15             0
code_view              0
code_waterfront        0
is_reno                0
bins_yrbuilt           0
bins_sqftbasement      0
bins_sqftabove         0
dtype: int64
In [71]:
# REMOVING OUTLIERS FROM PRICE

drop_col = 'sqft_living'

print(f'for {drop_col}:')
print(f'# of intial rows: {len(df_norm)}')
df_norm.dropna(subset=[drop_col],inplace=True)
print(f'# after dropping rows: {len(df_norm)}')
print(f'\nOutliers remaining: \n{df_norm.isna().sum()}')
for sqft_living:
# of intial rows: 19502
# after dropping rows: 19352

Outliers remaining: 
price                  0
bedrooms               0
bathrooms            147
sqft_living            0
sqft_lot               0
floors                 0
condition              0
grade                  0
zipcode                0
sqft_living15        176
sqft_lot15             0
code_view              0
code_waterfront        0
is_reno                0
bins_yrbuilt           0
bins_sqftbasement      0
bins_sqftabove         0
dtype: int64
In [72]:
# REMOVING OUTLIERS FROM BATHROOMS
drop_col = 'bathrooms'

print(f'for {drop_col}:')
print(f'# of intial rows: {len(df_norm)}')
df_norm.dropna(subset=[drop_col],inplace=True)
print(f'# after dropping rows: {len(df_norm)}')
print(f'\nOutliers remaining: \n{df_norm.isna().sum()}')
for bathrooms:
# of intial rows: 19352
# after dropping rows: 19205

Outliers remaining: 
price                  0
bedrooms               0
bathrooms              0
sqft_living            0
sqft_lot               0
floors                 0
condition              0
grade                  0
zipcode                0
sqft_living15        152
sqft_lot15             0
code_view              0
code_waterfront        0
is_reno                0
bins_yrbuilt           0
bins_sqftbasement      0
bins_sqftabove         0
dtype: int64
In [73]:
drop_col = 'sqft_living15'

print(f'for {drop_col}:')
print(f'# of intial rows: {len(df_norm)}')
df_norm.dropna(subset=[drop_col],inplace=True)
print(f'# after dropping rows: {len(df_norm)}')
print(f'\nOutliers remaining: \n{df_norm.isna().sum()}')
for sqft_living15:
# of intial rows: 19205
# after dropping rows: 19053

Outliers remaining: 
price                0
bedrooms             0
bathrooms            0
sqft_living          0
sqft_lot             0
floors               0
condition            0
grade                0
zipcode              0
sqft_living15        0
sqft_lot15           0
code_view            0
code_waterfront      0
is_reno              0
bins_yrbuilt         0
bins_sqftbasement    0
bins_sqftabove       0
dtype: int64

NORMALIZING UNITS (RobustScaler)

In [74]:
# ADDING OUTLIER REMOVAL FROM preprocessing.RobuseScaler
from sklearn.preprocessing import RobustScaler

robscaler = RobustScaler()
robscaler
Out[74]:
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)
In [75]:
vars_to_scale = ['sqft_living','sqft_lot','sqft_living15','sqft_lot15','bedrooms','bathrooms']

for col in vars_to_scale:
            
    col_data = np.array(np.array(df_norm[col]))
    res = robscaler.fit_transform(col_data.reshape(-1,1)) #,df['price'])
    df_norm['sca_'+col] = res.flatten()
In [76]:
# IF DROPPING VARS UNCOMMENT BELOW
# df_norm.drop(vars_to_scale,axis=1,inplace=True)
In [77]:
df_norm.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 23 columns):
price                19053 non-null float64
bedrooms             19053 non-null float64
bathrooms            19053 non-null float64
sqft_living          19053 non-null float64
sqft_lot             19053 non-null int64
floors               19053 non-null float64
condition            19053 non-null int64
grade                19053 non-null int64
zipcode              19053 non-null category
sqft_living15        19053 non-null float64
sqft_lot15           19053 non-null int64
code_view            19053 non-null int8
code_waterfront      19053 non-null int8
is_reno              19053 non-null int64
bins_yrbuilt         19053 non-null category
bins_sqftbasement    19053 non-null category
bins_sqftabove       19053 non-null category
sca_sqft_living      19053 non-null float64
sca_sqft_lot         19053 non-null float64
sca_sqft_living15    19053 non-null float64
sca_sqft_lot15       19053 non-null float64
sca_bedrooms         19053 non-null float64
sca_bathrooms        19053 non-null float64
dtypes: category(4), float64(12), int64(5), int8(2)
memory usage: 2.7 MB
In [78]:
df_norm.head()
Out[78]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade zipcode sqft_living15 ... is_reno bins_yrbuilt bins_sqftbasement bins_sqftabove sca_sqft_living sca_sqft_lot sca_sqft_living15 sca_sqft_lot15 sca_bedrooms sca_bathrooms
0 221900.0 3.0 1.00 1180.0 5650 1.0 3 7 98178 1340.0 ... 0 (1950.0, 1960.0] (-inf, 1.0] (1000.0, 1500.0] -0.663265 -0.355075 -0.569620 -0.391872 0.0 -1.00
1 538000.0 3.0 2.25 2570.0 7242 2.0 3 7 98125 1690.0 ... 1 (1950.0, 1960.0] (1.0, 501.0] (2000.0, 2500.0] 0.755102 -0.047858 -0.126582 0.026953 0.0 0.25
2 180000.0 2.0 1.00 770.0 10000 1.0 3 6 98028 2720.0 ... 0 (1930.0, 1940.0] (-inf, 1.0] (500.0, 1000.0] -1.081633 0.484369 1.177215 0.116024 -1.0 -1.00
3 604000.0 4.0 3.00 1960.0 5000 1.0 5 7 98136 1360.0 ... 0 (1960.0, 1970.0] (501.0, 1001.0] (1000.0, 1500.0] 0.132653 -0.480509 -0.544304 -0.528743 1.0 1.00
4 510000.0 3.0 2.00 1680.0 8080 1.0 3 8 98074 1800.0 ... 0 (1980.0, 1990.0] (-inf, 1.0] (1500.0, 2000.0] -0.153061 0.113856 0.012658 -0.001685 0.0 0.00

5 rows × 23 columns

Normalizing by log+z-score (OLD)

In [79]:
# # Normalize chosen vars
# # df_norm = pd.DataFrame({})
# df_norm['logz_price'] = log_z(df['price'])
# df_norm['logz_sqft_living'] = log_z(df['sqft_living'])a
# df_norm['logz_sqft_living15'] = log_z(df['sqft_living15'])
# df_norm['logz_sqft_lot'] = log_z(df['sqft_lot'])
# df_norm['logz_sqft_lot15'] = log_z(df['sqft_lot15'])
# df_norm.index = df.index

Normalizing to z-score + removing >3 STD (OLD)

In [80]:
# Z-SCORE AND REMOVE >3 STD 
# vars_dont_norm = ['yr_renovated','yr_built','floors','view','grade','condition']
# vars_to_norm = df.describe().drop(vars_dont_norm,axis=1).columns
# vars_to_rem_out = None

# col_names =vars_to_rem_out
# df_norm = df[vars_to_norm].copy()
# df_norm['id'] = df['id'].copy()
# df_norm.set_index('id')
# df_norm.info()
                  
# for col in col_names:
#       if col != 'id':
#         df_norm['zClean_'+col] = rem_out_z(df[col])
#     # df_norm  = df_norm.filter(regex=('log'),axis=1)
# df_norm.info()
In [81]:
# Drop na
# # df_run = pd.DataFrame({)
# temp_df = df.filter(regex=('bin'),axis=1)
# df_run = pd.concat([df_norm,temp_df],axis=1)#,sort=False)
# df_run.describe()
In [82]:
# # LOG TRANSFORM AND Z-SCORE DATA
# col_names = vars_to_norm
# df_norm = df[vars_to_norm].copy()
# # df_norm.info()
# for col in col_names:
#     data_to_log = df.loc[df[col]>0]
#     df_norm['logZ_'+col] = log_z(data_to_log[col])
# #     df_norm['logZ_'+col] = log_z(df[col])
# df_norm  = df_norm.filter(regex=('log'),axis=1)
# df_norm.info()

CHECKING NORMALIZED DATASET

In [83]:
plt.style.use('fivethirtyeight')
plot_hist_scat_sns(df_norm)
In [84]:
df_norm.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 23 columns):
price                19053 non-null float64
bedrooms             19053 non-null float64
bathrooms            19053 non-null float64
sqft_living          19053 non-null float64
sqft_lot             19053 non-null int64
floors               19053 non-null float64
condition            19053 non-null int64
grade                19053 non-null int64
zipcode              19053 non-null category
sqft_living15        19053 non-null float64
sqft_lot15           19053 non-null int64
code_view            19053 non-null int8
code_waterfront      19053 non-null int8
is_reno              19053 non-null int64
bins_yrbuilt         19053 non-null category
bins_sqftbasement    19053 non-null category
bins_sqftabove       19053 non-null category
sca_sqft_living      19053 non-null float64
sca_sqft_lot         19053 non-null float64
sca_sqft_living15    19053 non-null float64
sca_sqft_lot15       19053 non-null float64
sca_bedrooms         19053 non-null float64
sca_bathrooms        19053 non-null float64
dtypes: category(4), float64(12), int64(5), int8(2)
memory usage: 2.7 MB

Recheck multipol

In [85]:
multiplot(df_norm.filter(regex='sca',axis=1))
plt.title('Scaled Data only')
Out[85]:
Text(0.5, 1.0, 'Scaled Data only')
In [86]:
X =df_norm.loc[:,~(df_norm.columns.str.startswith('sca'))]
multiplot(X.drop('price',axis=1))
plt.title('Un-scaled Data')
X=[]

No multicollinearity to worry about. Huzzah!

In [87]:
# DEFINING DATASET TO RUN (df_run)
df_run = df_norm.copy()
# df_run.to_csv(data_filepath+'df_run_pre_codes.csv')
df_run.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 23 columns):
price                19053 non-null float64
bedrooms             19053 non-null float64
bathrooms            19053 non-null float64
sqft_living          19053 non-null float64
sqft_lot             19053 non-null int64
floors               19053 non-null float64
condition            19053 non-null int64
grade                19053 non-null int64
zipcode              19053 non-null category
sqft_living15        19053 non-null float64
sqft_lot15           19053 non-null int64
code_view            19053 non-null int8
code_waterfront      19053 non-null int8
is_reno              19053 non-null int64
bins_yrbuilt         19053 non-null category
bins_sqftbasement    19053 non-null category
bins_sqftabove       19053 non-null category
sca_sqft_living      19053 non-null float64
sca_sqft_lot         19053 non-null float64
sca_sqft_living15    19053 non-null float64
sca_sqft_lot15       19053 non-null float64
sca_bedrooms         19053 non-null float64
sca_bathrooms        19053 non-null float64
dtypes: category(4), float64(12), int64(5), int8(2)
memory usage: 2.7 MB

CAT.CODES FOR BINNED DATA

In [88]:
df_filt = pd.DataFrame({})

df_filt = df_run.filter(regex=('bins_'),axis =1).copy()
df_filt['zipcode']=df_run['zipcode'].copy()
df_filt.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 4 columns):
bins_yrbuilt         19053 non-null category
bins_sqftbasement    19053 non-null category
bins_sqftabove       19053 non-null category
zipcode              19053 non-null category
dtypes: category(4)
memory usage: 227.0 KB
In [89]:
# Creating binned vars cat codes
for col in df_filt:
    df_filt['code'+col] = df_filt[col].cat.codes
    df_filt.drop(col,axis=1,inplace=True)
df_filt.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 4 columns):
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: int8(4)
memory usage: 223.3 KB
In [90]:
df_filt.head()
Out[90]:
codebins_yrbuilt codebins_sqftbasement codebins_sqftabove codezipcode
0 5 0 2 66
1 5 1 4 55
2 3 0 1 16
3 6 2 2 58
4 8 0 3 37

Concatenate final df for modeling (df_run)

In [91]:
## Select columns that do not contain the string 'logZ'
# df_run =df_run.loc[:,~(df_run.columns.str.startswith('logZ'))]
In [92]:
# # df_filt['price'] = df_run['price'].copy()
# df_run =df_run.loc[:,~(df_run.columns.str.startswith('bin'))]
# df_run.info()
In [93]:
# # DECIDE NOW IF WANT CLEAN OR NOT CLEANED DATA

# to_drop_unclean = ['bedrooms','bathrooms','sqft_living','sqft_lot','sqft_living15','sqft_lot15','zClean_price']
# to_drop_clean = list(df_run.loc[:,~(df_run.columns.str.startswith('zClean'))].columns)
# to_drop_unclean, to_drop_clean
In [94]:
# DROP UNCLEAN
# df_run.drop(['id','bedrooms','bathrooms','sqft_living','sqft_lot','sqft_living15','sqft_lot15','zClean_price'],axis=1,inplace=True)
# df_run.drop(to_drop_unclean,axis=1,inplace=True)
In [95]:
# Concatenate codebins from df_filt + df_run
df_run = pd.concat([df_run, df_filt],axis=1)  #).filter(regex=('code'))],axis=1)
df_run.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 27 columns):
price                    19053 non-null float64
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
zipcode                  19053 non-null category
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
bins_yrbuilt             19053 non-null category
bins_sqftbasement        19053 non-null category
bins_sqftabove           19053 non-null category
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: category(4), float64(12), int64(5), int8(6)
memory usage: 2.8 MB
In [96]:
df_run.head(2)
Out[96]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade zipcode sqft_living15 ... sca_sqft_living sca_sqft_lot sca_sqft_living15 sca_sqft_lot15 sca_bedrooms sca_bathrooms codebins_yrbuilt codebins_sqftbasement codebins_sqftabove codezipcode
0 221900.0 3.0 1.00 1180.0 5650 1.0 3 7 98178 1340.0 ... -0.663265 -0.355075 -0.569620 -0.391872 0.0 -1.00 5 0 2 66
1 538000.0 3.0 2.25 2570.0 7242 2.0 3 7 98125 1690.0 ... 0.755102 -0.047858 -0.126582 0.026953 0.0 0.25 5 1 4 55

2 rows × 27 columns

Clever line of code to select columns by name

# Select columns that do not contain the string 'logZ'
df_run =df_run.loc[:,~(df_run.columns.str.startswith('logZ'))]

Saving/loading df_run after cleaning up

In [97]:
# bookmark 202am
# df_run.to_csv(data_filepath+'df_run_coded.csv')
In [98]:
# df_run = pd.read_csv('df_run_coded.csv')
# df_run.info()

FITTING AN INTIAL MODEL:

In [99]:
df_run.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 27 columns):
price                    19053 non-null float64
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
zipcode                  19053 non-null category
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
bins_yrbuilt             19053 non-null category
bins_sqftbasement        19053 non-null category
bins_sqftabove           19053 non-null category
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: category(4), float64(12), int64(5), int8(6)
memory usage: 2.8 MB

DETERMINING IDEAL FEATURES TO USE

  • Use MinMaxScaler to get on same scale
  • Use RFE to find the best features
  • Get ranking of feature importance (from both scaled and unscaled data)
In [100]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler

import sklearn.metrics as metrics
In [101]:
df_run.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 27 columns):
price                    19053 non-null float64
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
zipcode                  19053 non-null category
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
bins_yrbuilt             19053 non-null category
bins_sqftbasement        19053 non-null category
bins_sqftabove           19053 non-null category
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: category(4), float64(12), int64(5), int8(6)
memory usage: 2.8 MB
In [102]:
# Define selector function combining RFE and linear regression
linreg = LinearRegression()
selector = RFE(linreg, n_features_to_select=1)

# Drop already scaled variables for this feature testing
X =df_run.loc[:,~(df_run.columns.str.startswith(('bins','zip')))]
X = X.drop('price',axis=1)

# RUNNING RFE ON THE UNSCALED DATA(DEMONSTRATION)
Y = df_run['price']
# Y = df_run['logz_price']
# X = df_run.drop(['price'],axis=1)
In [103]:
# Checking X
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 22 columns):
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: float64(11), int64(5), int8(6)
memory usage: 2.6 MB
In [104]:
# Run regressions on X,Y 
selector = selector.fit(X,Y)

# Saving unscaled rankings for demo purposes
no_scale = selector.ranking_
In [105]:
# Scale all variables to value between 0-1 to use RFE to determine which features are the most important for determining price?
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Scale the data before running RFE
print('Consider revisiting this step and dummy-coding zipcode.')

# ONLY SCALE NON-CATEGORICAL, ONE-HOT CATEGORICAL
scaler.fit(X,Y)
scaled_data = scaler.transform(X)
scaled_data.shape
Consider revisiting this step and dummy-coding zipcode.
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\sklearn\preprocessing\data.py:323: DataConversionWarning: Data with input dtype int8, int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)
Out[105]:
(19053, 22)
In [106]:
# Running RFE with scaled data
selector = selector.fit(scaled_data, Y) 
scaled = selector.ranking_
In [107]:
# Create a dataframe with the ranked values of each feature for both scaled and unscaled data
best_features = pd.DataFrame({'columns':X.columns, 'scaled_rank' : scaled,'unscaled_rank':no_scale})
best_features.set_index('columns',inplace=True)
In [108]:
# Display dataframe (sorted based on unscaled rank)
best_features.sort_values('unscaled_rank')
Out[108]:
scaled_rank unscaled_rank
columns
grade 1 1
code_view 8 2
codebins_sqftbasement 11 3
codebins_sqftabove 21 4
floors 9 5
codebins_yrbuilt 3 6
sca_bathrooms 18 7
is_reno 20 8
condition 13 9
sca_bedrooms 19 10
bathrooms 15 11
bedrooms 16 12
code_waterfront 22 13
codezipcode 17 14
sqft_living 2 15
sqft_living15 12 16
sqft_lot15 14 17
sqft_lot 5 18
sca_sqft_living15 7 19
sca_sqft_living 4 20
sca_sqft_lot15 6 21
sca_sqft_lot 10 22
In [109]:
# Plot the difference in feature importance between analyzing scaled and unscaled data. 
# For demonstration purposes.
features = pd.DataFrame({'Columns':X.columns, 'Not_Scaled':no_scale, 'Scaled':scaled})
# PLot the difference between 
plt.style.use('seaborn')
# plt.xkcd()
ax = features.set_index('Columns').plot(kind = 'bar',figsize=(12,8))
ax.set_title('Feature Importance Scaled vs. Not Scaled')
ax.set_ylabel('Features Importance')
Out[109]:
Text(0, 0.5, 'Features Importance')

Using elbow plots to identify the best # of features to use

In [110]:
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 22 columns):
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: float64(11), int64(5), int8(6)
memory usage: 2.6 MB

Plot R_squared and MSE for Scaled Data

In [111]:
r_squared = []
for x in range(1, len(X.columns)):
    selector = RFE(linreg, n_features_to_select=x)
    selector.fit(scaled_data, Y)
    linreg.fit(X[X.columns[selector.support_]], Y)
    r_sq = linreg.score(X[X.columns[selector.support_]], Y)
    r_squared.append(r_sq)
In [112]:
# r_squared
In [113]:
from sklearn.metrics import mean_squared_error

mse=[]
for x in range(1, len(X.columns)):
    selector = RFE(linreg,  n_features_to_select=x)
    selector.fit(scaled_data, Y)
    linreg.fit(X[X.columns[selector.support_]], Y)
    y_pred = linreg.predict(X[X.columns[selector.support_]])
    mse.append(mean_squared_error(Y, y_pred))
# mse
In [114]:
from sklearn.metrics import mean_squared_error

fig = plt.figure(figsize=(12 ,6))

ax1 = plt.subplot(121)
ax1.plot(range(1, len(X.columns)), r_squared)
ax1.set_ylabel('R_Squared')
ax1.set_xlabel('Number of Features')
ax1.grid()

ax2 = plt.subplot(122)
ax2.plot(range(1,len(mse)+1), mse )
ax2.set_ylabel('MSE')

ax2.set_xlabel('Number of Features',fontsize=20)
ax2.grid()

Plot R_squared and MSE for Unscaled Data

In [115]:
r_squared = []
for x in range(1, len(X.columns)):
    selector = RFE(linreg, n_features_to_select=x)
    selector.fit(X, Y)
    linreg.fit(X[X.columns[selector.support_]], Y)
    r_sq = linreg.score(X[X.columns[selector.support_]], Y)
    r_squared.append(r_sq)

    
from sklearn.metrics import mean_squared_error

mse=[]
for x in range(1, len(X.columns)):
    selector = RFE(linreg,  n_features_to_select=x)
    selector.fit(X, Y)
    linreg.fit(X[X.columns[selector.support_]], Y)
    y_pred = linreg.predict(X[X.columns[selector.support_]])
    mse.append(mean_squared_error(Y, y_pred))



fig = plt.figure(figsize=(12 ,6))

ax1 = plt.subplot(121)
ax1.plot(range(1, len(X.columns)), r_squared)
ax1.set_ylabel('R_Squared')
ax1.set_xlabel('Number of Features')
ax1.grid()

ax2 = plt.subplot(122)
ax2.plot(range(1,len(mse)+1), mse )
ax2.set_ylabel('MSE')

ax2.set_xlabel('Number of Features')
ax2.grid() 

Choosing Features Based on Rankings

  • The above figure is indicating that right now my best possible R2 with the lowest # of features would be with 6 predictors (judging unscaled data).
  • Now examine the sorted best_features dataframe to see which 6 to use
In [116]:
best_features.sort_values('unscaled_rank')
Out[116]:
scaled_rank unscaled_rank
columns
grade 1 1
code_view 8 2
codebins_sqftbasement 11 3
codebins_sqftabove 21 4
floors 9 5
codebins_yrbuilt 3 6
sca_bathrooms 18 7
is_reno 20 8
condition 13 9
sca_bedrooms 19 10
bathrooms 15 11
bedrooms 16 12
code_waterfront 22 13
codezipcode 17 14
sqft_living 2 15
sqft_living15 12 16
sqft_lot15 14 17
sqft_lot 5 18
sca_sqft_living15 7 19
sca_sqft_living 4 20
sca_sqft_lot15 6 21
sca_sqft_lot 10 22
In [117]:
# Pull out the 6 best features via ranking 
best_num_feat = 6
selected_features = best_features.sort_values('unscaled_rank')[0:best_num_feat]
selected_features
Out[117]:
scaled_rank unscaled_rank
columns
grade 1 1
code_view 8 2
codebins_sqftbasement 11 3
codebins_sqftabove 21 4
floors 9 5
codebins_yrbuilt 3 6
In [118]:
# USING UNSCALED
selected_features.index
Out[118]:
Index(['grade', 'code_view', 'codebins_sqftbasement', 'codebins_sqftabove',
       'floors', 'codebins_yrbuilt'],
      dtype='object', name='columns')
In [119]:
# Check the columns of X
X[selected_features.index].head()
Out[119]:
grade code_view codebins_sqftbasement codebins_sqftabove floors codebins_yrbuilt
0 7 0 0 2 1.0 5
1 7 0 1 4 2.0 5
2 6 0 0 1 1.0 3
3 7 0 2 2 1.0 6
4 8 0 0 3 1.0 8

PRELIMINARY UNIVARIATE LINEAR REGRESSION MODELING

In [120]:
df_run.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19053 entries, 0 to 21596
Data columns (total 27 columns):
price                    19053 non-null float64
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null int64
floors                   19053 non-null float64
condition                19053 non-null int64
grade                    19053 non-null int64
zipcode                  19053 non-null category
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null int64
code_view                19053 non-null int8
code_waterfront          19053 non-null int8
is_reno                  19053 non-null int64
bins_yrbuilt             19053 non-null category
bins_sqftbasement        19053 non-null category
bins_sqftabove           19053 non-null category
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null int8
codebins_sqftbasement    19053 non-null int8
codebins_sqftabove       19053 non-null int8
codezipcode              19053 non-null int8
dtypes: category(4), float64(12), int64(5), int8(6)
memory usage: 2.8 MB
In [121]:
# Running simple linear regression for each predictor on its own
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
import statsmodels.stats.api as sms


# log_price = np.log(df['price'])
# df['log_price'] = log_price

target_var = 'price'
col_names = df_run.drop(['price'],axis=1).columns

# Create results list for saving the output statstics for each predictor
results = [['ind_var', 'r_squared', 'intercept', 'slope', 'p-value' ]] 

for idx, val in enumerate(col_names): 
    
    # Use the names of the columns to determine format of forumla  
    if val.startswith('code'):
        
        df_run[val] = df_run[val].astype('category').cat.as_ordered() 
        f =f'{str(target_var)}~C({val})'
        
    elif val.startswith('bin'):
        
        df_run[val] = df_run[val].cat.as_ordered() 
        f =f'{str(target_var)}~C({val})'
        
    else:
        
        f =f'{str(target_var)}~{val}'
        
    # Run the ols models     
    model = smf.ols(formula=f, data=df_run).fit() 
    model.summary()
    
    # Append results
    results.append([val, model.rsquared, model.params[0], model.params[1], model.pvalues[1] ]) 
    
# Turn results into dataframe with correct index and columns
res_df = pd.DataFrame(results)
res_df.columns = res_df.iloc[0]
res_df=res_df[1:]
res_df.set_index('ind_var',inplace=True)
res_df.sort_values('r_squared',ascending=False)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\statsmodels\regression\linear_model.py:1633: RuntimeWarning: divide by zero encountered in double_scalars
  return np.sqrt(eigvals[0]/eigvals[-1])
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\statsmodels\base\model.py:1100: RuntimeWarning: invalid value encountered in true_divide
  return self.params / self.bse
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\scipy\stats\_distn_infrastructure.py:877: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\scipy\stats\_distn_infrastructure.py:877: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\scipy\stats\_distn_infrastructure.py:1831: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)
Out[121]:
r_squared intercept slope p-value
ind_var
codezipcode 0.498102 280365 -47520.3 0.0002275
zipcode 0.498102 280365 -47520.3 0.0002275
grade 0.362959 -454845 123114 0
sqft_living 0.347657 142997 168.572 0
sca_sqft_living 0.347657 451484 165200 0
sqft_living15 0.27516 118553 184.751 0
sca_sqft_living15 0.27516 449258 145953 0
codebins_sqftabove 0.225333 210050 141861 0.107357
bins_sqftabove 0.225333 210050 141861 0.107357
bathrooms 0.169802 215907 125027 0
sca_bathrooms 0.169802 465961 125027 0
sca_bedrooms 0.0717169 447998 68611.9 0
bedrooms 0.0717169 242163 68611.9 0
floors 0.0629911 330868 93475.5 1.61001e-271
code_view 0.0563913 455177 168640 1.60061e-43
bins_sqftbasement 0.0492165 442306 19843.1 3.81601e-06
codebins_sqftbasement 0.0492165 442306 19843.1 3.81601e-06
codebins_yrbuilt 0.0482952 543106 -35754.2 0.000899441
bins_yrbuilt 0.0482952 543106 -35754.2 0.000899441
is_reno 0.00765121 465065 103741 1.10517e-33
sca_sqft_lot 0.00581721 465533 2035.97 5.50156e-26
sqft_lot 0.00581721 462590 0.392892 5.50156e-26
sqft_lot15 0.00411256 461884 0.518291 7.96816e-19
sca_sqft_lot15 0.00411256 465777 2461.37 7.96816e-19
code_waterfront 0.00239512 467280 203173 1.64907e-11
condition 0.00235454 416978 14985.1 2.065e-11
In [122]:
# Initial variables for modeling
try_modeling = ['codezipcode', 'grade','sca_sqft_living', 'sca_sqft_living15']
# Hmm...realized there are redundant versions of variables and am just selecting the correct versions to use.

NOTES FOLLOWING PRELIMINARY LINEAR REGRESSIONS

  • Variables that had high R_square with logz_price:
    • New results = ['codezipcode', 'grade','sca_sqft_living', 'sca_sqft_living15']
try_modeling = try_modeling = ['codezipcode', 'grade','sca_sqft_living', 'sca_sqft_living16']

MULTIVARIATE REGRESSIONS

In [123]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
import statsmodels.stats.api as sms
from sklearn.preprocessing import MinMaxScaler

# # PUTTING TOGETHER THE PREDICTORS TO RUN IN THE REGRESSION
# ## Last min dummy vars []'cat_grade','cat_zipcode','cat_view','cat_bins_sqft_above','cat_bins_sqft_basement']
# dum_grades = pd.get_dummies(df_run['cat_grade'],prefix='gr').iloc[:,:-1]
# dum_view = pd.get_dummies(df_run['cat_view'], prefix='view').iloc[:,:-1]
# dum_sqft_above = pd.get_dummies(df_run['cat_bins_sqftabove'],prefix='sqftAb').iloc[:,:-1]
# dum_sqft_base = pd.get_dummies(df_run['cat_bins_sqftbasement'],prefix='sqftBa').iloc[:,:-1]


# RUNNING K-FOLD VALIDATION WITH STATSMODELS OLS.
# X = df_run.drop(['price','logZ_price'],axis=1)
# list_predictors = ['logZ_sqft_living','logZ_sqft_living15','bedrooms','bathrooms','floors']
# scaler = MinMaxScaler()
# sel_columns = selected_features.index
In [124]:
# Define X, Y 
X = df_run[try_modeling]
# X.columns

Y = df_run['price']
# y = df_run['logZ_price']

# Get a list of predictor names string 
list_predictors = [str(x) for x in X.columns]
list_predictors.append('intercept')
list_predictors
Out[124]:
['codezipcode', 'grade', 'sca_sqft_living', 'sca_sqft_living15', 'intercept']
In [125]:
# Comcatenate X,Y for OLS
df_run_ols = pd.concat([Y,X],axis=1)

# Import packages
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import scipy.stats as stats

# Enter equation for selected predictors: (use C to run as categorical) 
# f1 = 'price~C(codezipcode)+C(grade)+sca_sqft_living+sca_sqft_living15' # 0.8 r1 Adjusted
f1 = 'price~C(codezipcode)+grade+sca_sqft_living+sca_sqft_living15' 

# Run model and report sumamry
model = smf.ols(formula=f1, data=df_run_ols).fit()
model.summary()
Out[125]:
OLS Regression Results
Dep. Variable: price R-squared: 0.797
Model: OLS Adj. R-squared: 0.796
Method: Least Squares F-statistic: 1035.
Date: Thu, 11 Apr 2019 Prob (F-statistic): 0.00
Time: 14:40:51 Log-Likelihood: -2.4441e+05
No. Observations: 19053 AIC: 4.890e+05
Df Residuals: 18980 BIC: 4.895e+05
Df Model: 72
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 5152.6368 8731.030 0.590 0.555 -1.2e+04 2.23e+04
C(codezipcode)[T.1] 2.361e+04 8215.662 2.873 0.004 7503.139 3.97e+04
C(codezipcode)[T.2] 2341.0236 7378.061 0.317 0.751 -1.21e+04 1.68e+04
C(codezipcode)[T.3] 5.114e+05 8885.250 57.560 0.000 4.94e+05 5.29e+05
C(codezipcode)[T.4] 3.405e+05 9150.155 37.213 0.000 3.23e+05 3.58e+05
C(codezipcode)[T.5] 2.802e+05 6825.525 41.045 0.000 2.67e+05 2.94e+05
C(codezipcode)[T.6] 2.573e+05 9464.329 27.187 0.000 2.39e+05 2.76e+05
C(codezipcode)[T.7] 2.557e+05 7457.922 34.291 0.000 2.41e+05 2.7e+05
C(codezipcode)[T.8] 1.054e+05 1.05e+04 10.074 0.000 8.49e+04 1.26e+05
C(codezipcode)[T.9] 1.348e+05 8258.863 16.319 0.000 1.19e+05 1.51e+05
C(codezipcode)[T.10] 1.282e+05 9852.210 13.014 0.000 1.09e+05 1.48e+05
C(codezipcode)[T.11] 8.965e+04 8371.036 10.710 0.000 7.32e+04 1.06e+05
C(codezipcode)[T.12] 4.869e+04 7811.281 6.234 0.000 3.34e+04 6.4e+04
C(codezipcode)[T.13] -2.103e+04 6398.367 -3.287 0.001 -3.36e+04 -8487.932
C(codezipcode)[T.14] 1.723e+05 1.21e+04 14.294 0.000 1.49e+05 1.96e+05
C(codezipcode)[T.15] 1.893e+05 6900.034 27.438 0.000 1.76e+05 2.03e+05
C(codezipcode)[T.16] 1.282e+05 7366.804 17.401 0.000 1.14e+05 1.43e+05
C(codezipcode)[T.17] 2.198e+05 7223.982 30.430 0.000 2.06e+05 2.34e+05
C(codezipcode)[T.18] 907.7456 7534.878 0.120 0.904 -1.39e+04 1.57e+04
C(codezipcode)[T.19] 1.147e+04 7391.085 1.552 0.121 -3015.547 2.6e+04
C(codezipcode)[T.20] 3234.7083 9600.539 0.337 0.736 -1.56e+04 2.21e+04
C(codezipcode)[T.21] 3.27e+05 6873.901 47.578 0.000 3.14e+05 3.41e+05
C(codezipcode)[T.22] 1.858e+05 6322.193 29.392 0.000 1.73e+05 1.98e+05
C(codezipcode)[T.23] 3.341e+04 6189.647 5.398 0.000 2.13e+04 4.55e+04
C(codezipcode)[T.24] 6.446e+05 3.72e+04 17.333 0.000 5.72e+05 7.17e+05
C(codezipcode)[T.25] 4.396e+05 8998.802 48.846 0.000 4.22e+05 4.57e+05
C(codezipcode)[T.26] 1.342e+04 6249.489 2.147 0.032 1166.174 2.57e+04
C(codezipcode)[T.27] 1.194e+05 7939.547 15.036 0.000 1.04e+05 1.35e+05
C(codezipcode)[T.28] 2.492e+05 6292.625 39.602 0.000 2.37e+05 2.62e+05
C(codezipcode)[T.29] 2.347e+05 6937.430 33.827 0.000 2.21e+05 2.48e+05
C(codezipcode)[T.30] 4.06e+04 7516.748 5.402 0.000 2.59e+04 5.53e+04
C(codezipcode)[T.31] 1.093e+05 6693.020 16.330 0.000 9.62e+04 1.22e+05
C(codezipcode)[T.32] 3.932e+04 6534.860 6.018 0.000 2.65e+04 5.21e+04
C(codezipcode)[T.33] 9.51e+04 6589.751 14.431 0.000 8.22e+04 1.08e+05
C(codezipcode)[T.34] 1.201e+05 7412.001 16.199 0.000 1.06e+05 1.35e+05
C(codezipcode)[T.35] 1.975e+05 9885.649 19.982 0.000 1.78e+05 2.17e+05
C(codezipcode)[T.36] 1.742e+05 7612.251 22.880 0.000 1.59e+05 1.89e+05
C(codezipcode)[T.37] 2.137e+05 6770.358 31.559 0.000 2e+05 2.27e+05
C(codezipcode)[T.38] 2.286e+05 7438.126 30.732 0.000 2.14e+05 2.43e+05
C(codezipcode)[T.39] 1.781e+05 8684.281 20.503 0.000 1.61e+05 1.95e+05
C(codezipcode)[T.40] -1.148e+04 6965.141 -1.648 0.099 -2.51e+04 2173.340
C(codezipcode)[T.41] 4.05e+05 1.11e+04 36.408 0.000 3.83e+05 4.27e+05
C(codezipcode)[T.42] 3.297e+05 6239.985 52.830 0.000 3.17e+05 3.42e+05
C(codezipcode)[T.43] 3.908e+05 8585.748 45.522 0.000 3.74e+05 4.08e+05
C(codezipcode)[T.44] 1.203e+05 7095.802 16.956 0.000 1.06e+05 1.34e+05
C(codezipcode)[T.45] 3.252e+05 7546.464 43.096 0.000 3.1e+05 3.4e+05
C(codezipcode)[T.46] 1.193e+05 8417.753 14.169 0.000 1.03e+05 1.36e+05
C(codezipcode)[T.47] 4.236e+05 1.13e+04 37.566 0.000 4.01e+05 4.46e+05
C(codezipcode)[T.48] 4.364e+05 8724.465 50.019 0.000 4.19e+05 4.53e+05
C(codezipcode)[T.49] 3.271e+05 6266.348 52.207 0.000 3.15e+05 3.39e+05
C(codezipcode)[T.50] 3.16e+05 7192.406 43.936 0.000 3.02e+05 3.3e+05
C(codezipcode)[T.51] 3.237e+05 6318.347 51.230 0.000 3.11e+05 3.36e+05
C(codezipcode)[T.52] 1.744e+05 6436.099 27.102 0.000 1.62e+05 1.87e+05
C(codezipcode)[T.53] 4.177e+05 9094.361 45.931 0.000 4e+05 4.36e+05
C(codezipcode)[T.54] 3.163e+05 7509.896 42.116 0.000 3.02e+05 3.31e+05
C(codezipcode)[T.55] 2.019e+05 6737.030 29.975 0.000 1.89e+05 2.15e+05
C(codezipcode)[T.56] 2.127e+05 6999.314 30.390 0.000 1.99e+05 2.26e+05
C(codezipcode)[T.57] 1.563e+05 6413.853 24.371 0.000 1.44e+05 1.69e+05
C(codezipcode)[T.58] 2.722e+05 7617.985 35.728 0.000 2.57e+05 2.87e+05
C(codezipcode)[T.59] 2.5e+05 7152.431 34.956 0.000 2.36e+05 2.64e+05
C(codezipcode)[T.60] 1.264e+05 7419.402 17.037 0.000 1.12e+05 1.41e+05
C(codezipcode)[T.61] 5.933e+04 1.32e+04 4.489 0.000 3.34e+04 8.52e+04
C(codezipcode)[T.62] 1.486e+05 6562.321 22.643 0.000 1.36e+05 1.61e+05
C(codezipcode)[T.63] 1.299e+05 7685.234 16.905 0.000 1.15e+05 1.45e+05
C(codezipcode)[T.64] 5.063e+04 7497.228 6.753 0.000 3.59e+04 6.53e+04
C(codezipcode)[T.65] 2.276e+05 7840.903 29.027 0.000 2.12e+05 2.43e+05
C(codezipcode)[T.66] 7.063e+04 7516.470 9.396 0.000 5.59e+04 8.54e+04
C(codezipcode)[T.67] 3.793e+04 9456.808 4.011 0.000 1.94e+04 5.65e+04
C(codezipcode)[T.68] 4.591e+04 7416.408 6.191 0.000 3.14e+04 6.05e+04
C(codezipcode)[T.69] 3.666e+05 7441.021 49.269 0.000 3.52e+05 3.81e+05
grade 3.645e+04 1004.395 36.289 0.000 3.45e+04 3.84e+04
sca_sqft_living 1.058e+05 1483.602 71.280 0.000 1.03e+05 1.09e+05
sca_sqft_living15 3.646e+04 1505.741 24.213 0.000 3.35e+04 3.94e+04
Omnibus: 2745.646 Durbin-Watson: 2.000
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8873.079
Skew: 0.739 Prob(JB): 0.00
Kurtosis: 5.998 Cond. No. 491.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Cross-Validation with K-Fold Test-Train Splits:

- f1 = 'price ~ C(codezipcode) + grade + sca_sqft_living + sca_sqft_living15'
  • price ~ zipcode(category), grade(category), sqft_living(scaled to median, RobustScaler) + sqft_living15 (scaled to median, RobustScaler)

Save df_run_ols to csv

In [126]:
# df_run_ols.to_csv(data_filepath+'df_run_ols_model.csv')
In [127]:
# Visualize Q-Q Plots
resid1=model.resid
fig = sm.graphics.qqplot(resid1, dist=stats.norm, line='45', fit=True,marker='.')
  • Interpreting the Q-Q plot:
    • The Q-Q plot looks a bit crazy and may indicate... outliers?
    • The only thing I did not check for outliers in final model was zipcode.
    • Will run cross-validation with test-train-split to help decide
In [128]:
# Visualizing final dataset again.
import matplotlib.pyplot as plt

# Re-inspecting XY
# plot_hist_scat(df_run_ols)

FINAL REGRESSION RESULTS

K-Fold valiation with OLS

In [129]:
# k_fold_val_ols(X,y,k=10):
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics 

y = df_run['price']


# Run 10-fold cross validation
results = [['set#','R_square_train','MSE_train','R_square_test','MSE_test']]

num_coeff = X.shape[1]

list_predictors = [str(x) for x in X.columns]
list_predictors.append('intercept') 

reg_params = [list_predictors]

i=0
k=10
while i <(k+1):
    X_train, X_test, y_train, y_test = train_test_split(X,y) #,stratify=[cat_col_names])

    data = pd.concat([X_train,y_train],axis=1)
    f1 = 'price~C(codezipcode)+grade+sca_sqft_living+sca_sqft_living15' 
    model = smf.ols(formula=f1, data=data).fit()
    model.summary()
    
    y_hat_train = model.predict(X_train)
    y_hat_test = model.predict(X_test)

    train_residuals = y_hat_train - y_train
    test_residuals = y_hat_test - y_test

        
    train_mse = metrics.mean_squared_error(y_train, y_hat_train)
    test_mse = metrics.mean_squared_error(y_test, y_hat_test)

    R_sqare_train = metrics.r2_score(y_train,y_hat_train)
    R_square_test = metrics.r2_score(y_test,y_hat_test)

    results.append([i,R_sqare_train,train_mse,R_square_test,test_mse])
    i+=1

    
results = pd.DataFrame(results[1:],columns=results[0])
results
model.summary()
Out[129]:
OLS Regression Results
Dep. Variable: price R-squared: 0.798
Model: OLS Adj. R-squared: 0.797
Method: Least Squares F-statistic: 779.6
Date: Thu, 11 Apr 2019 Prob (F-statistic): 0.00
Time: 14:40:54 Log-Likelihood: -1.8330e+05
No. Observations: 14289 AIC: 3.667e+05
Df Residuals: 14216 BIC: 3.673e+05
Df Model: 72
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 6635.8178 1.01e+04 0.659 0.510 -1.31e+04 2.64e+04
C(codezipcode)[T.1] 2.369e+04 9309.639 2.545 0.011 5440.509 4.19e+04
C(codezipcode)[T.2] -1301.4114 8621.597 -0.151 0.880 -1.82e+04 1.56e+04
C(codezipcode)[T.3] 4.977e+05 1.02e+04 48.962 0.000 4.78e+05 5.18e+05
C(codezipcode)[T.4] 3.375e+05 1.06e+04 31.955 0.000 3.17e+05 3.58e+05
C(codezipcode)[T.5] 2.799e+05 8005.623 34.961 0.000 2.64e+05 2.96e+05
C(codezipcode)[T.6] 2.596e+05 1.08e+04 24.008 0.000 2.38e+05 2.81e+05
C(codezipcode)[T.7] 2.614e+05 8784.961 29.753 0.000 2.44e+05 2.79e+05
C(codezipcode)[T.8] 1.04e+05 1.22e+04 8.497 0.000 8e+04 1.28e+05
C(codezipcode)[T.9] 1.331e+05 9683.166 13.745 0.000 1.14e+05 1.52e+05
C(codezipcode)[T.10] 1.237e+05 1.08e+04 11.412 0.000 1.02e+05 1.45e+05
C(codezipcode)[T.11] 8.721e+04 9445.059 9.233 0.000 6.87e+04 1.06e+05
C(codezipcode)[T.12] 4.618e+04 9050.352 5.103 0.000 2.84e+04 6.39e+04
C(codezipcode)[T.13] -2.229e+04 7406.562 -3.009 0.003 -3.68e+04 -7771.151
C(codezipcode)[T.14] 1.675e+05 1.36e+04 12.296 0.000 1.41e+05 1.94e+05
C(codezipcode)[T.15] 1.898e+05 7935.543 23.923 0.000 1.74e+05 2.05e+05
C(codezipcode)[T.16] 1.289e+05 8408.627 15.335 0.000 1.12e+05 1.45e+05
C(codezipcode)[T.17] 2.175e+05 8368.403 25.992 0.000 2.01e+05 2.34e+05
C(codezipcode)[T.18] -789.4975 8671.290 -0.091 0.927 -1.78e+04 1.62e+04
C(codezipcode)[T.19] 1.155e+04 8529.494 1.354 0.176 -5171.235 2.83e+04
C(codezipcode)[T.20] 2203.1479 1.11e+04 0.199 0.842 -1.95e+04 2.39e+04
C(codezipcode)[T.21] 3.194e+05 7943.032 40.207 0.000 3.04e+05 3.35e+05
C(codezipcode)[T.22] 1.844e+05 7338.080 25.128 0.000 1.7e+05 1.99e+05
C(codezipcode)[T.23] 3.13e+04 7160.140 4.372 0.000 1.73e+04 4.53e+04
C(codezipcode)[T.24] 6.125e+05 4.55e+04 13.452 0.000 5.23e+05 7.02e+05
C(codezipcode)[T.25] 4.396e+05 1.03e+04 42.706 0.000 4.19e+05 4.6e+05
C(codezipcode)[T.26] 9562.1926 7287.218 1.312 0.189 -4721.709 2.38e+04
C(codezipcode)[T.27] 1.093e+05 9254.643 11.810 0.000 9.12e+04 1.27e+05
C(codezipcode)[T.28] 2.516e+05 7287.621 34.518 0.000 2.37e+05 2.66e+05
C(codezipcode)[T.29] 2.342e+05 7982.635 29.333 0.000 2.19e+05 2.5e+05
C(codezipcode)[T.30] 4.599e+04 8842.087 5.201 0.000 2.87e+04 6.33e+04
C(codezipcode)[T.31] 1.091e+05 7687.762 14.190 0.000 9.4e+04 1.24e+05
C(codezipcode)[T.32] 3.814e+04 7566.685 5.041 0.000 2.33e+04 5.3e+04
C(codezipcode)[T.33] 9.232e+04 7604.981 12.140 0.000 7.74e+04 1.07e+05
C(codezipcode)[T.34] 1.197e+05 8633.858 13.859 0.000 1.03e+05 1.37e+05
C(codezipcode)[T.35] 2.173e+05 1.12e+04 19.322 0.000 1.95e+05 2.39e+05
C(codezipcode)[T.36] 1.694e+05 8804.154 19.239 0.000 1.52e+05 1.87e+05
C(codezipcode)[T.37] 2.127e+05 7720.386 27.557 0.000 1.98e+05 2.28e+05
C(codezipcode)[T.38] 2.213e+05 8537.817 25.919 0.000 2.05e+05 2.38e+05
C(codezipcode)[T.39] 1.712e+05 9988.865 17.140 0.000 1.52e+05 1.91e+05
C(codezipcode)[T.40] -1.169e+04 8011.544 -1.459 0.145 -2.74e+04 4015.223
C(codezipcode)[T.41] 3.956e+05 1.27e+04 31.064 0.000 3.71e+05 4.21e+05
C(codezipcode)[T.42] 3.35e+05 7197.943 46.547 0.000 3.21e+05 3.49e+05
C(codezipcode)[T.43] 3.934e+05 9645.659 40.783 0.000 3.74e+05 4.12e+05
C(codezipcode)[T.44] 1.171e+05 8199.668 14.286 0.000 1.01e+05 1.33e+05
C(codezipcode)[T.45] 3.249e+05 8639.938 37.599 0.000 3.08e+05 3.42e+05
C(codezipcode)[T.46] 1.129e+05 9742.804 11.590 0.000 9.38e+04 1.32e+05
C(codezipcode)[T.47] 4.264e+05 1.26e+04 33.761 0.000 4.02e+05 4.51e+05
C(codezipcode)[T.48] 4.353e+05 1.03e+04 42.415 0.000 4.15e+05 4.55e+05
C(codezipcode)[T.49] 3.293e+05 7218.682 45.612 0.000 3.15e+05 3.43e+05
C(codezipcode)[T.50] 3.172e+05 8166.213 38.845 0.000 3.01e+05 3.33e+05
C(codezipcode)[T.51] 3.233e+05 7326.388 44.132 0.000 3.09e+05 3.38e+05
C(codezipcode)[T.52] 1.759e+05 7422.859 23.692 0.000 1.61e+05 1.9e+05
C(codezipcode)[T.53] 4.262e+05 1.07e+04 39.830 0.000 4.05e+05 4.47e+05
C(codezipcode)[T.54] 3.164e+05 8622.209 36.697 0.000 3e+05 3.33e+05
C(codezipcode)[T.55] 2.031e+05 7725.040 26.291 0.000 1.88e+05 2.18e+05
C(codezipcode)[T.56] 2.129e+05 8054.593 26.434 0.000 1.97e+05 2.29e+05
C(codezipcode)[T.57] 1.539e+05 7425.533 20.730 0.000 1.39e+05 1.68e+05
C(codezipcode)[T.58] 2.692e+05 8766.764 30.707 0.000 2.52e+05 2.86e+05
C(codezipcode)[T.59] 2.476e+05 8270.468 29.935 0.000 2.31e+05 2.64e+05
C(codezipcode)[T.60] 1.296e+05 8590.723 15.086 0.000 1.13e+05 1.46e+05
C(codezipcode)[T.61] 5.614e+04 1.46e+04 3.847 0.000 2.75e+04 8.48e+04
C(codezipcode)[T.62] 1.505e+05 7596.333 19.810 0.000 1.36e+05 1.65e+05
C(codezipcode)[T.63] 1.274e+05 8950.136 14.236 0.000 1.1e+05 1.45e+05
C(codezipcode)[T.64] 5.384e+04 8806.491 6.114 0.000 3.66e+04 7.11e+04
C(codezipcode)[T.65] 2.214e+05 9359.338 23.654 0.000 2.03e+05 2.4e+05
C(codezipcode)[T.66] 7.043e+04 8676.044 8.118 0.000 5.34e+04 8.74e+04
C(codezipcode)[T.67] 3.955e+04 1.11e+04 3.576 0.000 1.79e+04 6.12e+04
C(codezipcode)[T.68] 4.308e+04 8713.418 4.944 0.000 2.6e+04 6.02e+04
C(codezipcode)[T.69] 3.599e+05 8619.687 41.756 0.000 3.43e+05 3.77e+05
grade 3.637e+04 1160.376 31.340 0.000 3.41e+04 3.86e+04
sca_sqft_living 1.076e+05 1726.171 62.363 0.000 1.04e+05 1.11e+05
sca_sqft_living15 3.647e+04 1737.962 20.987 0.000 3.31e+04 3.99e+04
Omnibus: 2103.968 Durbin-Watson: 2.036
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6834.868
Skew: 0.753 Prob(JB): 0.00
Kurtosis: 6.036 Cond. No. 498.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [130]:
results
Out[130]:
set# R_square_train MSE_train R_square_test MSE_test
0 0 0.797243 8.030139e+09 0.795109 8.428913e+09
1 1 0.797210 8.048519e+09 0.795121 8.374950e+09
2 2 0.796630 8.158708e+09 0.796859 8.044344e+09
3 3 0.799855 7.943015e+09 0.787515 8.689515e+09
4 4 0.798647 8.094723e+09 0.790775 8.231387e+09
5 5 0.797357 8.148087e+09 0.794719 8.072952e+09
6 6 0.797534 8.123376e+09 0.794343 8.141397e+09
7 7 0.793946 8.132617e+09 0.804649 8.115010e+09
8 8 0.797634 8.082581e+09 0.794066 8.262846e+09
9 9 0.797798 8.064640e+09 0.793344 8.328589e+09
10 10 0.797918 8.119989e+09 0.792454 8.172863e+09

Q-Q Plots

In [131]:
resid1=model.resid
fig = sm.graphics.qqplot(resid1, dist=stats.norm, line='45', fit=True,marker='.')
In [132]:
df_model = pd.concat([df_run[try_modeling],df_run['price']],axis=1)
# df_model.to_csv(data_filepath+'df_housing_model.csv')
# df_run.to_csv(data_filepath+'df_house_all_final_data.csv')
In [133]:
results.describe()
Out[133]:
set# R_square_train MSE_train R_square_test MSE_test
count 11.000000 11.000000 1.100000e+01 11.000000 1.100000e+01
mean 5.000000 0.797434 8.086036e+09 0.794450 8.260251e+09
std 3.316625 0.001437 6.275401e+07 0.004216 1.886623e+08
min 0.000000 0.793946 7.943015e+09 0.787515 8.044344e+09
25% 2.500000 0.797227 8.056580e+09 0.792899 8.128204e+09
50% 5.000000 0.797534 8.094723e+09 0.794343 8.231387e+09
75% 7.500000 0.797858 8.127997e+09 0.795115 8.351770e+09
max 10.000000 0.799855 8.158708e+09 0.804649 8.689515e+09

FINAL MODEL - New

  • For k=10 fold validation, with price as target variable:

    • mean r_squared for the test sets was 0.797, with mean MSE = 8.158691e+09

    f1 = 'price~C(codezipcode)+C(grade)+sca_sqft_living+sca_sqft_living15'

  • Predictors in final model:

    • 'Zipcode'
    • 'grade'
    • 'sqft_living'
    • 'sqfr_living15'
  • My final model indicates that the size, location, and housing geade to be critical components in determining salesprice.

Predictor Coefficients & Their Affect On Sales Price

  • Grade, sqft_living, and sqft_living15 all have straight-forward relationships with sales price, with positive coefficients.
    • sqft_living is a larger component of the price (coefficient: 1.043e+05)
    • grade (coeff: 3.679e+04)and sqft_living15 (coeff: 3.767e+04) have a similar magnitude of an effect on sales price
  • Zipcode is a bit trickier, as each zipcode has its own coefficient.
    • Overall, zipcode has a positive coefficient/effect on price (to varying degrees)
    • There are, however, a couple zipcodes that negatively impact sales price.

Future Directions

  • With more time I would have proceeded to perform the following steps:
    • Additional changes to the predictors in the model
      • Trying to remove outliers from zipcodes
    • Explored additional transformations to the data.
      • I was trying to be conservative to keep the interpretability of my model intact.
      • Log-transforming the data improved the distributions but made it more difficult to interpret.
    • I would further tweak the quality of the visuals, particularly changing the x-tick labels and rotation.
In [134]:
df_dropped.index
Out[134]:
RangeIndex(start=0, stop=21597, step=1)
In [135]:
df_run.describe()
Out[135]:
price bedrooms bathrooms sqft_living sqft_lot floors condition grade sqft_living15 sqft_lot15 is_reno sca_sqft_living sca_sqft_lot sca_sqft_living15 sca_sqft_lot15 sca_bedrooms sca_bathrooms
count 1.905300e+04 19053.000000 19053.000000 19053.000000 1.905300e+04 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000 19053.000000
mean 4.681033e+05 3.293025 2.017136 1928.589513 1.403172e+04 1.468142 3.411746 7.496720 1892.003464 12000.351703 0.029287 0.100602 1.262394 0.129118 0.945326 0.293025 0.017136
std 1.999766e+05 0.780532 0.659094 699.470303 3.882067e+04 0.536934 0.647548 0.978593 567.785762 24743.514984 0.168613 0.713745 7.491446 0.718716 5.210258 0.780532 0.659094
min 7.800000e+04 2.000000 0.750000 440.000000 5.200000e+02 1.000000 1.000000 4.000000 399.000000 651.000000 0.000000 -1.418367 -1.345041 -1.760759 -1.444515 -1.000000 -1.250000
25% 3.135000e+05 3.000000 1.500000 1400.000000 5.000000e+03 1.000000 3.000000 7.000000 1460.000000 5026.000000 0.000000 -0.438776 -0.480509 -0.417722 -0.523268 0.000000 -0.500000
50% 4.310000e+05 3.000000 2.000000 1830.000000 7.490000e+03 1.000000 3.000000 7.000000 1790.000000 7511.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 5.890000e+05 4.000000 2.500000 2380.000000 1.018200e+04 2.000000 4.000000 8.000000 2250.000000 9775.000000 0.000000 0.561224 0.519491 0.582278 0.476732 1.000000 0.500000
max 1.120000e+06 5.000000 3.500000 4230.000000 1.651359e+06 3.500000 5.000000 11.000000 3660.000000 560617.000000 1.000000 2.448980 317.226746 2.367089 116.467888 2.000000 1.500000
In [136]:
import pandas as pd

df_final_data=pd.concat([df_run, df_dropped[['lat','long','id']]],axis=1)
df_final_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21597 entries, 0 to 21596
Data columns (total 30 columns):
price                    19053 non-null float64
bedrooms                 19053 non-null float64
bathrooms                19053 non-null float64
sqft_living              19053 non-null float64
sqft_lot                 19053 non-null float64
floors                   19053 non-null float64
condition                19053 non-null float64
grade                    19053 non-null float64
zipcode                  19053 non-null category
sqft_living15            19053 non-null float64
sqft_lot15               19053 non-null float64
code_view                19053 non-null category
code_waterfront          19053 non-null category
is_reno                  19053 non-null float64
bins_yrbuilt             19053 non-null category
bins_sqftbasement        19053 non-null category
bins_sqftabove           19053 non-null category
sca_sqft_living          19053 non-null float64
sca_sqft_lot             19053 non-null float64
sca_sqft_living15        19053 non-null float64
sca_sqft_lot15           19053 non-null float64
sca_bedrooms             19053 non-null float64
sca_bathrooms            19053 non-null float64
codebins_yrbuilt         19053 non-null category
codebins_sqftbasement    19053 non-null category
codebins_sqftabove       19053 non-null category
codezipcode              19053 non-null category
lat                      21597 non-null float64
long                     21597 non-null float64
id                       21597 non-null int64
dtypes: category(10), float64(19), int64(1)
memory usage: 3.7 MB
In [137]:
# save final output
df_final_data.to_csv(data_filepath+'kc_housing_model_df_final_data.csv')

SUMMARY FIGURE CODE FOR PRESENTATION

In [138]:
# Reset the visual style of the notebook
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams.update(inline_rc)

# inline_rc = dict(mpl.rcParams)
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\matplotlib\__init__.py:855: MatplotlibDeprecationWarning: 
examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
C:\Users\james\Anaconda3\envs\learn-env\lib\site-packages\matplotlib\__init__.py:846: MatplotlibDeprecationWarning: 
The text.latex.unicode rcparam was deprecated in Matplotlib 2.2 and will be removed in 3.1.
  "2.2", name=key, obj_type="rcparam", addendum=addendum)
In [139]:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
plt.style.use('dark_background')
# plt.style.use('dark')


# Define the figure and axes and the # of subplots, sharing the y axes
fig, ax = plt.subplots(figsize=(16,12), ncols=2, nrows=2, sharey=True)

## Defining Formatting to be Used

# Formatting dollar sign labels
fmtPrice = '${x:,.0f}'
tickPrice = mtick.StrMethodFormatter(fmtPrice)

# Axis Label fonts
fontTitle = {'fontsize': 20,
           'fontweight': 'bold',
            'fontfamily':'serif'}

fontAxis = {'fontsize': 16,
           'fontweight': 'bold',
            'fontfamily':'serif'}

fontTicks = {'fontsize': 12,
           'fontweight':'medium',
            'fontfamily':'serif'}

# The amount of space above titles
y_title_margin = 1.01

# Major title
# plt.suptitle("Critical Factors for Predicting Sales Price", y = 1.0, fontdict=fontTitle, fontsize=22)


## Subplot 1
i,j=0,0
ax[i,j].set_title("Zipcode",y = y_title_margin,fontdict=fontTitle)#, y = y_title_margin)
sns.stripplot(df_final_data['zipcode'],df_final_data['price'],ax=ax[i,j],marker='o',size=3)

# Remove xticks
ax[i,j].set_xticks([]), ax[i,j].set_xlabel('')

# Change y-tick labels
ax[i,j].set_ylabel('Price',fontdict=fontAxis)

yticklab = ax[i,j].get_yticklabels()
ax[i,j].set_yticklabels(yticklab,fontdict=fontTicks)
ax[i,j].get_yaxis().set_major_formatter(tickPrice)

# Set y-grid
ax[i, j].set_axisbelow(True)
ax[i, j].grid(axis='y',ls='--')



## Subplot 2
i,j = 0,1
ax[i,j].set_title("Housing Grade",y = y_title_margin,fontdict=fontTitle)
ax[i,j].title.set_fontsize(20)
sns.stripplot(df_final_data['grade'],df_final_data['price'],ax=ax[i,j],marker='o',size=2)

#Set x axis
xticklab=ax[i,j].get_xticklabels()
ax[i,j].set_xticklabels(xticklab,fontdict=fontTicks)
ax[i,j].set_xlabel('Grade')


# Change y-tick labels
ax[i,j].set_ylabel('')# 'Price',fontdict=fontAxis)

# Set y-grid
ax[i, j].set_axisbelow(True)
ax[i, j].grid(axis='y',ls='--')


yticklab = ax[i,j].get_yticklabels()
ax[i,j].set_yticklabels(yticklab,fontdict=fontTicks)
ax[i,j].get_yaxis().set_major_formatter(tickPrice) 


## Subplot 3
i,j = 1,0

# Title
ax[i,j].set_title("Living Space (sqft)",y = y_title_margin,fontdict=fontTitle)
ax[i,j].title.set_fontsize(20)

# Define the scatter plot and line graph aesthetics
line_kws={"color":"orange","alpha":0.5,"lw":8,"ls":":"}
scatter_kws={'s': 5, 'alpha': 0.5,'marker':'.','color':'red'}


# Plot seaborn plot 
sns.regplot(df_final_data['sqft_living'], df_final_data['price'],ax=ax[i,j], scatter_kws=scatter_kws, line_kws=line_kws) #,marker='o',size=2) 
# sns.stripplot(df_final_data['sqft_living'], df_final_data['price'],ax=ax[i,j],marker='.') #,marker='o',size=2)

## Change the x-axis 
ax[i,j].set_xlabel('Area (sqft)',fontdict=fontAxis)

# Get ticks, rotate labels, and return
# xticks = ax[i,j].get_xticks()
xticklab=ax[i,j].get_xticklabels()
ax[i,j].set_xticklabels(xticklab,fontdict=fontTicks, rotation=45)

# Change the major units of x-axis
ax[i,j].xaxis.set_major_locator(mtick.MultipleLocator(500))
ax[i,j].xaxis.set_major_formatter(mtick.ScalarFormatter())

## Change y-axis
# Change y-tick labels
ax[i,j].set_ylabel('Price',fontdict=fontAxis)

yticklab = ax[i,j].get_yticklabels()
ax[i,j].set_yticklabels(yticklab,fontdict=fontTicks)
ax[i,j].get_yaxis().set_major_formatter(tickPrice) 

# Set y-grid
ax[i, j].set_axisbelow(True)
ax[i, j].grid(axis='y',ls='--')



# ## Subplot 4
i,j = 1,1
ax[i,j].set_title("Neighbor's Living Space (sqft)",y = y_title_margin,fontdict=fontTitle)
ax[i,j].title.set_fontsize(20)

# Define the scatter plot and line graph aesthetics
line_kws={"color":"lime","alpha":0.5,"lw":8,"ls":":"}
scatter_kws={'s': 5, 'alpha': 0.5,'marker':'.','color':'blueviolet'}

# Plot seaborn plot 
sns.regplot(df_final_data['sqft_living15'], df_final_data['price'],ax=ax[i,j], scatter_kws=scatter_kws, line_kws=line_kws)

# Change the x-axis labels
ax[i,j].set_xlabel('Area (sqft)',fontdict=fontAxis)

# Get ticks, rotate labels, and return
xticklab=ax[i,j].get_xticklabels()
ax[i,j].set_xticklabels(xticklab,fontdict=fontTicks, rotation=45)

# Change the major units of x-axis
ax[i,j].xaxis.set_major_locator(mtick.MultipleLocator(500))
ax[i,j].xaxis.set_major_formatter(mtick.ScalarFormatter())

# Change y-tick labels
ax[i,j].set_ylabel('')#Price',fontdict=fontAxis)

yticklab = ax[i,j].get_yticklabels()
ax[i,j].set_yticklabels(yticklab,fontdict=fontTicks)
ax[i,j].get_yaxis().set_major_formatter(tickPrice) 

# Set y-grid
ax[i, j].set_axisbelow(True)
ax[i, j].grid(axis='y',ls='--')

plt.tight_layout()
plt.savefig(fig_filepath+"summary_figure.png") # save as png
In [140]:
from IPython.display import Image
Image("Final Figures/map_median_price.png")
Out[140]:
In [141]:
Image("Final Figures/map_latlong_price.png")
Out[141]:
In [142]:
# plt.savefig(fig_filepath+"summary_figure.png") # save as png


END OF PROJECT 1 RE-SUBMISSION



import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
plt.style.use('dark_background')

# figSum, ax = plt.subplots(nrows=2,ncols=2,figsize=(16,16),sharey=True)
fig, ax = plt.subplots(figsize=(16,12), ncols=2, nrows=2, sharey=True)


# Formatting dollar sign labels
fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)

# Axis Label fonts
fontTitle = {'fontsize': 20,
           'fontweight': 'bold',
            'fontfamily':'serif'}

fontAxis = {'fontsize': 16,
           'fontweight': 'bold',
            'fontfamily':'serif'}

fontTicks = {'fontsize': 12,
           'fontweight':'medium',
            'fontfamily':'serif'}


# The amount of space above titles
y_title_margin = 1.01

# Major title
# plt.suptitle("Critical Factors for Predicting Sales Price", y = 1.0, fontdict=fontTitle, fontsize=22)

## Subplot 1
i,j=0,0
ax[i,j].set_title( "Zipcode", y = y_title_margin, fontdict = fontTitle)

sns.stripplot(df_run_ols['codezipcode'],df_run_ols['price'],ax=ax[i,j],marker='o',size=3,jitter=True)

# Remove xticks
ax[i,j].set_xticks([]), ax[i][j].set_xlabel('')

# Change y-tick labels
ax[i][j].set_ylabel('Price',fontdict=fontAxis)

yticklab = ax[i][j].get_yticklabels()
ax[i][j].set_yticklabels(yticklab,fontdict=fontTicks)
ax[i,j].get_yaxis().set_major_formatter(tick)